Re: Automated package-popularity survey (was Re: We need easier installation.)

To: matthew@sel.cam.ac.uk
Cc: Debian-Devel <debian-devel@lists.debian.org>
Subject: Re: Automated package-popularity survey (was Re: We need easier installation.)
From: Avery Pennarun <apenwarr@worldvisions.ca>
Date: Sat, 24 Oct 1998 12:57:51 -0400
Message-id: <[🔎] 19981024125751.C2165@worldvisions.ca>
In-reply-to: <[🔎] Pine.SOL.3.96.981024134503.11096D-100000@ursa.cus.cam.ac.uk>; from M.C. Vernon on Sat, Oct 24, 1998 at 01:49:48PM +0100
References: <[🔎] 19981024023110.C2822@worldvisions.ca> <[🔎] Pine.SOL.3.96.981024134503.11096D-100000@ursa.cus.cam.ac.uk>

On Sat, Oct 24, 1998 at 01:49:48PM +0100, M.C. Vernon wrote:

> On Sat, 24 Oct 1998, Avery Pennarun wrote:
> 
> > Your installed packages will be listed in order of "most popular" to
> > "least popular." Do the results look reasonable to people?  Is it doing
> > anything blatantly wrong?  Is this whole concept better than nothing, or
> > worse?
>
> the numbers don't look very meaningful (they are very large.....)...it
> might look nicer if you logged them or something ;)

Well, of course the script is only very preliminary.  The output isn't 100%
useful yet :)

The numbers are needed a bit to post-process the data later (eliminate
anything that was too-recently reinstalled).

> Anyway, my 'top ten' are:
> /bin/cat
> /usr/bin/basename
> /bin/grep
> /usr/bin/find
> /bin/rm
> /usr/bin/less
> /bin/sh
> /usr/bin/pico
> /usr/bin/xemacs-20.4-nomule
> /bin/sash
> 
> Pico probably beats xeamcs 'cos a lot of my users like it.

Looks about reasonable.  Of course, the top few are rather skewed since my
script needs them to run.  On the other hand, there would be people taking
pot shots at us if we took cat, basename, and grep out of the standard
distribution anyway :)

To properly analyze the output, you need to look through the whole file from
top to bottom.  There should be some point where you start saying, "Hmm, I
don't use this package much at all." All the packages below that point
should be similarly underused; everything above it, you use once in a while.

Actually I found it a useful way to remind me of a bunch of packages I don't
need :)

If you can say that about the output, then the statistics are useful. If I
get a list like that from lots of people, I can rank the packages by overall
popularity.  (NOTE: don't send output to me!! It's just a test version. 
Please, everyone, _do_ send me a quick note telling me whether the results
you see make any sense.)

> BTW  - any hints as to how that shell script works? I can do C but not
> bash... ;(

Hmm... here it is again, for reference.

===

1  #!/bin/bash
2  # Ha ha! I can write ugly code!
3
4  rm -f ignored-pkg
5
6  for d in /var/lib/dpkg/info/*.list; do
7     f=$(basename $d .list)
8     find $(cat $d \
9             | egrep '/bin/|/sbin/|/usr/games|\.[ah]$' \
10            || echo $d >>ignored-pkg) \
11         -follow -prune -type f -printf "%A@ %C@ $f %p\\n" 2>/dev/null \
12       | sort | tail -1
13 done | sort -r | less

===

Lines 1-3 are standard header junk.
4 removes the ignored-pkg file (hey, this is easy so far :))
6 starts a loop through all your dpkg .list files; $d will be assigned to
	each pathname in turn.
7 sets $f to the name of the package, without directory or .list suffix.
8 begins a rather long invocation of the 'find' command.  cat the .list
	file...
9 Through egrep, to identify only files in a 'bin', 'sbin', or '/usr/games'
	dir, or also .a files (for -dev packages) and .h files (for headers).
10 If no lines in the file match (egrep returns a failure code), add the
	.list filename to the ignored-pkg file for debugging purposes so I
	can see why it was skipped.
11 More parameters to our 'find' command.  
	-follow (follow symlinks)
	-prune (don't recurse)
	-type f (files only, not dirs; -follow means we include symlinks
		here too)
	-printf... (output format of each line)
		(hint: try an output format of "%a -- %c -- $f %p")
	2>/dev/null (don't print access errors)
12 Sort the output for each .list file by access time, then take only the
	bottom line; that's the most-recently-accessed useful file from each
	package.
13 Repeat the loop for each .list file, then sort in reverse
	(most-recently-used-package first) and pipe the output through less.

Aren't you glad I didn't write it in C? :)

Have fun,

Avery

Reply to:

References:
- Automated package-popularity survey (was Re: We need easier installation.)
  - From: Avery Pennarun <apenwarr@worldvisions.ca>
- Re: Automated package-popularity survey (was Re: We need easier installation.)
  - From: "M.C. Vernon" <mcv21@cus.cam.ac.uk>

Prev by Date: Re: Automated package-popularity survey (was Re: We need easier installation.)
Next by Date: Re: broken eterm
Previous by thread: Re: Automated package-popularity survey (was Re: We need easier installation.)
Next by thread: Automated package-popularity survey (was Re: We need easier installation.)
Index(es):
- Date
- Thread