[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[Popcon-developers] data from the past

Am Freitag, den 02.02.2007, 12:43 +0000 schrieb martin f krafft:
> also sprach Alain Schroeder <alain@debian.org> [2007.01.25.1348 +0000]:
> > You can find the aggregated results in
> > gluck:/org/popcon.debian.org/popcon-mail/all-popcon-results
> > 
> > There is no archive of the raw submission as far as I know.
> Is this something worth starting? It would have really helped me in
> my situation, so maybe it's a good investment into the future?
> I can't imagine it's a load of data with bzip2 compression...

It is currently about 150 MB bzipped and of course growing.

btw: I am building a package suggesting website/service for debian as
diploma thesis. Getting results takes a lot of time (as in days) and I
still have to a lot of data preprocessing. I can send you) some DB dumps
with first results. I do not know how much I can share yet - because is
as I said part of my diploma thesis.

I currently experimenting with 3 data sets:

	* including all official packages, without dependencies
	* including all official packages which were used within
          the last four weeks
        * above with filtered dependencies

About the privacy issue: I can fully understand Bill. But imo this just
bodes down to a security issue, because the LATEST data is already
always on the debian servers. As long as debian servers are secure, the
data is secure. I think if the archives would be filtered including only
packages, that are part of debian, it would be ok. But I doubt that data
e.g. older than 2 years really will be helpful. Doing those above
apriori analysis for a (Amazon like Suggestion Thingy) is btw not
possible using the curred archived data - you need the raw submissions.

Popularity-contest only really sends data if the user gives his consent
- unlike e.g. update-notifier, which just loads data from the internet
without asking - what happens to those http logs? I think both,
update-notifier and popularity-contest, need a privacy statement. I
think privacy whould be much more helped, if popcon data would get
filtered on the users computer and the result be submitted via TOR.

Back to my diploma thesis.

Short overview over my current plans:

	* Getting a webpage with the suggestion online.
	* Letting users rate those suggestions
	* Creating a SOAP Interface for easy integration into programs,
websites, whatever
	* Doing some segmentation analysis on the data
	* Curretly far of, but still on my schedule: creating personalized


Reply to: