[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Why apt-get is not a proper software search engine (was Re: And now for something completely different... etch!)



> > - Better package search mechanism (tags?) allowing free text search
> >   in package management interfaces: "I want a program that does X"
> 
> Doesn't 'apt-cache search X' do exactly that?

[ Here's the in-depth answer from my POV ]

Think of a *end* user that wants to find the most popular multi-user games
in Debian (maybe to play with fellow Debianites). You're saying he has
to:

[ open a terminal in his X session, yes, he could use synaptic but see 
below ]
$ apt-cache search multiuser game
[ shows only a few packages, including libraries, which the user is not 
interested in ]
$ apt-cache search multi-user game
[ different output due to the keyword change ]
$ apt-cache search multi user game
[ mix of the above ]
$ apt-cache search multi player game
[ different packages, shows both library and data packages which he will
never install directly ]

And in all these cases he still wouldn't be able to tell which ones are the
ones other Debian users use most. He would need to feed in the popcon data
for that. Scripting anyone? [1]

The user in this example really wants to see here:

- end-user packages (not library packages or data stuff pulled in through
dependencies)
- sorted by their popularity (i.e. installated base)
- one-click away from installation

No package frontend I am aware of can currently pull that stunt. Aptitude
or dselect can only search in the package names ('/' key). Synaptic can
search in the descriptions (with equivalent results as apt-get).

Moreover, text based searches in a free text area are not useful when all
words have the same weigth (i.e. no keywords). In order to be able to do
proper searches you need to use automatic language analysis algorithms that
will add weight to words (like TFDIF [2]).

Consider another example: a user wants to find a good mail reader for his
graphical environment (he's actually looking for an application like
thunderbird or evolution). How should he conduct the search? 'apt-get
search mail reader'? That's 77 packages he needs to sort out manually.
'apt-get search mail reader graphic' ? That lists only two packages,
neither of which fits his search.

Have I made the issue clear now? With all the software we currently have in
Debian it's *very* difficult for novice users to find what they are looking
for. They end up reading about which tools are good for them elsewhere
(i.e. Google) and then look for them in Debian. Instead of searching for
them with the tools they have in Debian first. 

Finally, if you review the above you'll see that I haven't mentioned i18n 
issues but those, too, are an important issue. Users can use our system 
fully i18nized except for the Debian package management system itself which 
is english only.

The more software we have, the more difficult it is for users to search 
in it using the current (crude) tools we provide.

Regards

Javier



[1] Attached to this mail is a script that implements this, you can see 
that adding popcon data to the search helps but doesn't still cut it since 
it will still show 'library' and 'data' packages which an end-user will 
rarely install on their own.

[2] I actually implemented this through a hack called 'dpkg-iasearch' which 
didn't caught up much attention. I didn't have time to work more on it, but 
it did allow for free text (non-keyword) searches using TFIDF to group the 
description of packages in clusters.
#!/usr/bin/perl -w
#
# Popular packages search
# (c) 2005 Javier Fernandez-Sanguino
#
# Run an 'apt-cache search' query and order the packages by popularity.
# You first need to retrieve the popcon data, use:
#
# wget -O all-popcon-results.txt.gz http://popcon.debian.org/all-popcon-results.txt.gz
#
# Usage:
#  - popular-packages.pl -p all-popcon-results.txt.gz "my query"
#    Show all packages with RC bugs sorted by popularity
#
# --------------------------------------------------------------------------
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
#
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#
# You can also find a copy of the GNU General Public License at
# http://www.gnu.org/licenses/licenses.html#TOCLGPL
#
# --------------------------------------------------------------------------

$POPCONURL="http://popcon.debian.org/all-popcon-results.txt.gz";;
use Getopt::Std;
use FileHandle;
getopts('hvp:'); 
# opt_h = print help
# opt_p = popularity contest file
# opt_v = verbose - currently

format popularity_top =
Packages sorted by popularity
Name                   Popularity
---------------------------------
.
format popularity =
@<<<<<<<<<<<<<<<<<<<<< @<<<<<<<
$package,              $popularity{$package}
.

if ( $opt_h ) {
	$opt_h = 0; # Shut -w up
	usage();
	exit 0;
}

my $query=shift;
if ( ! defined ($query) || $query eq "" ) {
	print STDERR "$0: Give me something to search for!\n";
	usage();
	exit 1;
}
if ( ! defined ($opt_p) ) {
	print STDERR "$0: You should provide a popularity contest data file!\n";
	usage();
	exit 1;
}

# Use apt-cache search
open (QUERY, "apt-cache search $query |") || die ("$0: Cannot run apt-cache: $!");

while (<QUERY>) {

# Here we go....
	chomp;
	print STDERR "\tParsing search result: '".$_."'\n" if $opt_v;
	if ( /^\s*(\S*)\s+-\s+(\S*)/ ) {
		$package = $1;
		$description = $2;
		print STDERR "\tAdding package $package to the list\n" if $opt_v;
		$packagelist{$package} = $description;
		$popularity{$package} = 0;
	}
}
close QUERY;

# Retrieve from
# http://people.debian.org/~apenwarr/popcon/all-popcon-results.txt.gz
if ( $opt_p ) {
	$popularity = $opt_p;
	[ -f $popularity ] || die ("File $popularity does not exist");
	open(POPULAR,"zcat -f -c $popularity | ") || die ("Cannot uncompress popularity: $!");
	while (<POPULAR>) {
# Format is package #Votes #Old #Recent #Unknown
		chomp;
		if ( /([\w\-\.]+)\s*(\d+)\s*(\d+)\s*(\d+)\s*(\d+)/ ) {
			print STDERR "\tPopularity for $1 is $2\n" if $opt_v;
			$popularity{$1}=$2;
		}
	} 
	close POPULAR;
}

format_name     STDOUT "popularity";
format_top_name STDOUT "popularity_top";
foreach $package ( sort { $popularity{$b} <=> $popularity{$a} } keys %popularity) {
	if ( defined ( $packagelist{$package} ) ) {
			write
	}
}

exit 0;

sub usage {
	print "Usage: $0 -p popcon_data \"text query\"\n";
	print "\t-p\tPopularity contest data file\n";
	print "\t\tDownload from $POPCONURL\n";
	print "\t-v\tBe more verbose\n";
	print "\t-h\tShow this help\n";
}

Attachment: signature.asc
Description: Digital signature


Reply to: