[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: "People who installed X also have packages Y, Z and T installed"



On 2007-03-03, Enrico Zini <enrico@debian.org> wrote:
> So far I've seen two causes for bad suggestions:
>
>  1) Suggestions for a package that is too popular tend to be
>     meaningless: this is because when I query Xapian with, for example,
>     "please give me 20 typical systems that have 'grep' installed", I
>     get random systems as all systems have grep installed.
>     This *might* be detectable looking at the Xapian's relevance
>     estimate, which I'd expect to be low in cases like this.

Is your code available?  I'd like to have a look at what your doing and
see if I can think of a way to avoid this problem.

Really you perhaps want to be able to just give Xapian some *terms* and
ask for other relevant terms, rather than having to pick some documents
to get relevant terms (where here a document is a "popcon user" and a
term is an "installed package" if I understand correctly).  I wonder if
there's an easy way to support that.

Looking through the list, I wonder if packages which aren't installed
widely suffer too.  For example:

Package: xapian-omega
Suggested: libxapian-dev, xapian-tools, emacs21, debsums, apt-show-versions, doxygen, acroread, aircrack-ng, jadetex, gnome-pkg-tools

Not terribly great as suggestions for things you might also be
interested in really.  The first two aren't direct dependencies, but
it's not suprising they co-occur a lot.  The rest seem a bit random.

Now xapian-omega hasn't been in a stable release, so it's not been
installed by all that many people - popcon says 17.  Perhaps the
small sample size is part of the problem here, as statistical flukes
won't be evened out much.

Another issue is that there may not actually be 10 good suggestions
for some packages.  Perhaps it's better to offer fewer rather than
scrape the barrel to get 10?

>> Perhaps you can give some idea on how you would implement a better
>> filtering.

You should filter out dependencies I think - I don't think they're
useful to have in for any of the uses of this data I can think of.

For example:

Package: g++-3.3
Suggested: gcc-3.3, libstdc++5-3.3-dev, netkit-inetd, ssh, ntp-simple, cpp-3.3, buildd, exim, postfix, libtasn1-2

But g++-3.3 (in unstable at least) has package dependencies on gcc-3.3
and libstdc++5-3.3-dev, and gcc-3.3 depends on cpp-3.3.  So if I have
g++-3.3 installed, I must already have those 3 packages installed too.

Cheers,
    Olly



Reply to: