[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Mining popocon data



Hi,

I guess there is a simple logical demonstration that providing the
two-place counting function F(i,j) counting the number of machines
that have both package i and j installed would open up the formal
possibility of reconstructing data we have promised not to allow
ourselves to do.  But, I think probably that original promise was a
bit too strong and should someday be adjusted.  The reasoning is as
follows:

Any DD may upload one new package p.  This package, at first, may have
only one user.  That one user may be easy to guess from a number of
other factors; e.g. perhaps it's the maintainer of p that has p
installed.  In any case, providing the two-place function F(i,j)
allows us to fully reconstruct exactly which packages the one user of
p has installed, by simply running through all other packages and
sampling F along
the row or column p.  This argument is almost identical to the
argument against providing arch - package cross products as well.  It
applies equally well to any two-place function (e.g. A(a,i) counts the
number of i package installs on arch a) regardless of what it tells us
about packages relating to any other entity at all as far as I see.

I think our current solution is easily demonstrably "secure" in that
it more or less merges us all into one person.  This is apparent from
the seeming "fact" that we have only the zero-place count M (total
packages) and the one-place function F(i) counting total installs of a
single package.  If this is really all we have, then we cannot do
anything more interesting with it as far as I can tell.  There needs
to be at least
one two-place function somewhere to have any chance of doing any sort
of linkage or correlation between packages.  So, it seems to me we
have to either agree that we will never to meaningful analysis of
packages in a larger context based on utilization and instead restrict
ourselves to "presumptive" stats based on Depends, Recommends,
Suggests, Build-Depends, etc.  Or we have to change the original
popcon promise to not offer a formal guarantee against privacy
Another option is to mark a certain subset of packages as "sensitive"
and simply remove
them from the F(i,j) matrix.

Personally, I am a bigtime privacy advocate and have even gone on
youtube to promote privacy already.  But in my opinion, the
statistical information that could be gained far outweighs the minor
cost of imperfect privacy here in the case of Debian package
statistical analysis.   I have been studying Debian for six years and
I still feel like I have no idea about most packages.  I guess it must
be that much more confusing for the majority of our users and I would
love to make some nice automatic graphs of how different packages
relate according to usage, bdeps, deps, recs, etc.

Best regards,

Rudi



Reply to: