What was done
=============
Improved script that separate the popularity-contest data into clusters
- Use sparse matrix for identify the submissions, in this case is using
row-based linked list sparse matrix and compressed sparse row matrix
- The identification of packages name into each submission was chaged from
read line by line and get the third parameter to use an regex to get
all packages name, this greatly reduces the runtime
- Refactor the code for remove all numpy.matrix with use only the
sparse matrix
- Use multiple processors to load popularity-contest submissions, and to
generate clusters with KMeans
- Before the scripts not run completely with over 12GB of data, because
his was consuming all memory of my PC, my PC has 8GB of memory. Actually
this script is running completely with 12GB of data, but for this works,
its needs call the python garbage collector, which decreases the
performance of the script to read the submissions data
- Change to save on file only the packages of each cluster, the submissions