[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Soc-coordination] Weekly 3 - Status Report

What was done

Improved script that separate the popularity-contest data into clusters

    - Use sparse matrix for identify the submissions, in this case is using
    row-based linked list sparse matrix and compressed sparse row matrix

    - The identification of packages name into each submission was chaged from
    read line by line and get the third parameter to use an regex to get
    all packages name, this greatly reduces the runtime

    - Refactor the code for remove all numpy.matrix with use only the
    sparse matrix

    - Use multiple processors to load popularity-contest submissions, and to
    generate clusters with KMeans

    - Before the scripts not run completely with over 12GB of data, because
    his was consuming all memory of my PC, my PC has 8GB of memory. Actually
    this script is running completely with 12GB of data, but for this works,
    its needs call the python garbage collector, which decreases the
    performance of the script to read the submissions data

    - Change to save on file only the packages of each cluster, the submissions
    packages was removed, with this the output data was reduced to 340K

    - Milestone: https://gitlab.com/AppRecommender/AppRecommender/milestones/5

To the next week

- Refactor the code, mainly the part of create the multiple processors

- Study more about user data security

- Make AppRecommender strategies works with the new output data format of the
create clusters script

- Milestone: https://gitlab.com/AppRecommender/AppRecommender/milestones/7

Reply to: