Field compression

To: Debian development list <debian-devel@lists.debian.org>
Subject: Field compression
From: Edward Betts <edward@debian.org>
Date: Wed, 19 Jul 2000 00:43:31 +0100
Message-id: <[🔎] 20000719004331.A4160@wb010.wav1.uea.ac.uk>
Mail-followup-to: Debian development list <debian-devel@lists.debian.org>

Forgive me this e-mail will probably be a bit rambling. I had an odd thought
and I thought I would play with it.

I have a few files in this format on my machine:

Field1: foo
Field2: bar

Field1: baz
Field2: bar

Examples would be the available file and my e-mail. I have to download the
package files, which are in this format, from the internet, so it is in my,
and my modems interest to compress them as much as possible.

I was wondering how gzip and bzip2 would compress these, would it be clever
enough to see the repeating fields and just replace them with small codes.

I started to investigate.

I downloaded the source code to gzip and started reading. Sidenote: did you
know that everytime you run gzip it checks if its program name length is
greater than 4, and if so it checks if it ends in ".exe". I think that
personally I would have used an #ifdef, think of all that time it must have
taken for those instructions if you add up all times that gzip has been run.
Anyway, where was I. Oh, yes, I was reading the source code to gzip. Well I
find an explanation of the algorithm used, and it is block based; 32k blocks
if I am correct. So each block will contain a copy of the field names. This
might not be bad on a small file, but my mail gets quite large, and the
available file in my /var/lib/dpkg directory is about 4000kb.

I start experimenting.

I write a simple perl script

#!/usr/bin/perl

@fields = ( "Package", "Priority", "Section", "Installed-Size", "Maintainer", 
	"Architecture", "Version", "Replaces", "Provides", "Depends", 
	"Pre-Depends", "Recommends", "Suggests", "Conflicts", "Filename", 
	"Size", "MD5sum", "Description"
);

while(<>) {
	$a=0;
	foreach $field (@fields) {
		s/^$field: /$a/;
		$a++;
	}
	print;
}

I know that I missed some fields, but this is just an experiment. Once I have
run the available file through my script it goes from looking like this:

Package: esh
Priority: optional
Section: shells
Installed-Size: 296
Maintainer: Edward Betts <edward@debian.org>
Architecture: i386
Version: 0.8-5
Depends: libc6 (>= 2.1.2), libncurses5, libreadline4 (>= 4.1)
Filename: dists/potato/main/binary-i386/shells/esh_0.8-5.deb
Size: 57750
MD5sum: b247614e9fa513b46dd7e9dda623e9a9
Description: the easy shell
 esh was primarily written out of a need for a simple and lightweight shell
 for Unix. As such, it deviates completely from all of the traditional
 shells, opting instead for a Lisp-like syntax. This allows exceptionally
 small size, both in terms of lines of code and memory consumption, while
 retaining remarkable flexibility and programmability.

To looking like this:

0esh
1optional
2shells
3296
4Edward Betts <edward@debian.org>
5i386
60.8-5
9libc6 (>= 2.1.2), libncurses5, libreadline4 (>= 4.1)
14dists/potato/main/binary-i386/shells/esh_0.8-5.deb
1557750
16b247614e9fa513b46dd7e9dda623e9a9
17the easy shell
 esh was primarily written out of a need for a simple and lightweight shell
 for Unix. As such, it deviates completely from all of the traditional
 shells, opting instead for a Lisp-like syntax. This allows exceptionally
 small size, both in terms of lines of code and memory consumption, while
 retaining remarkable flexibility and programmability.

Yes, I know that the numerical fields are messed up. Maybe if I had left the
space in. The uncompressed file is about 600kb small once run through this
perl script. If I compress it the gains are less, but still interesting:

$ ls -l
total 11768
3988 -rw-r--r--    1 edward   edward    4078967 Jul 18 23:23 available
 904 -rw-r--r--    1 edward   edward     920133 Jul 19 00:17 available.bz2
3384 -rw-r--r--    1 edward   edward    3457391 Jul 19 00:14 available.ed
 880 -rw-r--r--    1 edward   edward     896602 Jul 19 00:17 available.ed.bz2
1276 -rw-r--r--    1 edward   edward    1300381 Jul 19 00:15 available.ed.gz
1332 -rw-r--r--    1 edward   edward    1357048 Jul 19 00:11 available.gz
   4 -rwxr--r--    1 edward   edward        382 Jul 19 00:10 compress*
$

55kb shaved off the .gz and 22kb off the .bz2

Now this was just a crude experiment. Imagine the gains in size that could be
made if this was done properly. Have four tables: field names, package names
section names, maintainer names. Don't bother storing them in the compressor,
let it work them out from the file so that it is more generic, and suddenly
you have smaller files.

I understand that this will not be used for the Debian package file, because
we are not even using bzip2 yet. The justification for not using it is that 
bzip2 need about 4Mb of memory to run, and is slower than gzip2. This would be
bad on smaller machines, also another program has to move to main.

I wonder whether my idea is useful for anything? Can anybody see any flaws? Or
are the possible gains to small to worry about?

Small observation - I was just looking at the fields in the package file, I
spotted the `meta-package:' field, one I had not seen before. It only appears
on the netscape packages, how strange.

-- 
Don't worry  --  shop.

Reply to:

Follow-Ups:
- Re: Field compression
  - From: Peter Makholm <peter@makholm.net>
- Re: Field compression
  - From: Robert Bihlmeyer <robbe@orcus.priv.at>

Prev by Date: Re: rman (PolyglotMan) + TkMan are now under the Artistic License.
Next by Date: Re: new fields in debian/control
Previous by thread: Re: Just a question (was: ITP: galeon)
Next by thread: Re: Field compression
Index(es):
- Date
- Thread