Field compression
Forgive me this e-mail will probably be a bit rambling. I had an odd thought
and I thought I would play with it.
I have a few files in this format on my machine:
Field1: foo
Field2: bar
Field1: baz
Field2: bar
Examples would be the available file and my e-mail. I have to download the
package files, which are in this format, from the internet, so it is in my,
and my modems interest to compress them as much as possible.
I was wondering how gzip and bzip2 would compress these, would it be clever
enough to see the repeating fields and just replace them with small codes.
I started to investigate.
I downloaded the source code to gzip and started reading. Sidenote: did you
know that everytime you run gzip it checks if its program name length is
greater than 4, and if so it checks if it ends in ".exe". I think that
personally I would have used an #ifdef, think of all that time it must have
taken for those instructions if you add up all times that gzip has been run.
Anyway, where was I. Oh, yes, I was reading the source code to gzip. Well I
find an explanation of the algorithm used, and it is block based; 32k blocks
if I am correct. So each block will contain a copy of the field names. This
might not be bad on a small file, but my mail gets quite large, and the
available file in my /var/lib/dpkg directory is about 4000kb.
I start experimenting.
I write a simple perl script
#!/usr/bin/perl
@fields = ( "Package", "Priority", "Section", "Installed-Size", "Maintainer",
"Architecture", "Version", "Replaces", "Provides", "Depends",
"Pre-Depends", "Recommends", "Suggests", "Conflicts", "Filename",
"Size", "MD5sum", "Description"
);
while(<>) {
$a=0;
foreach $field (@fields) {
s/^$field: /$a/;
$a++;
}
print;
}
I know that I missed some fields, but this is just an experiment. Once I have
run the available file through my script it goes from looking like this:
Package: esh
Priority: optional
Section: shells
Installed-Size: 296
Maintainer: Edward Betts <edward@debian.org>
Architecture: i386
Version: 0.8-5
Depends: libc6 (>= 2.1.2), libncurses5, libreadline4 (>= 4.1)
Filename: dists/potato/main/binary-i386/shells/esh_0.8-5.deb
Size: 57750
MD5sum: b247614e9fa513b46dd7e9dda623e9a9
Description: the easy shell
esh was primarily written out of a need for a simple and lightweight shell
for Unix. As such, it deviates completely from all of the traditional
shells, opting instead for a Lisp-like syntax. This allows exceptionally
small size, both in terms of lines of code and memory consumption, while
retaining remarkable flexibility and programmability.
To looking like this:
0esh
1optional
2shells
3296
4Edward Betts <edward@debian.org>
5i386
60.8-5
9libc6 (>= 2.1.2), libncurses5, libreadline4 (>= 4.1)
14dists/potato/main/binary-i386/shells/esh_0.8-5.deb
1557750
16b247614e9fa513b46dd7e9dda623e9a9
17the easy shell
esh was primarily written out of a need for a simple and lightweight shell
for Unix. As such, it deviates completely from all of the traditional
shells, opting instead for a Lisp-like syntax. This allows exceptionally
small size, both in terms of lines of code and memory consumption, while
retaining remarkable flexibility and programmability.
Yes, I know that the numerical fields are messed up. Maybe if I had left the
space in. The uncompressed file is about 600kb small once run through this
perl script. If I compress it the gains are less, but still interesting:
$ ls -l
total 11768
3988 -rw-r--r-- 1 edward edward 4078967 Jul 18 23:23 available
904 -rw-r--r-- 1 edward edward 920133 Jul 19 00:17 available.bz2
3384 -rw-r--r-- 1 edward edward 3457391 Jul 19 00:14 available.ed
880 -rw-r--r-- 1 edward edward 896602 Jul 19 00:17 available.ed.bz2
1276 -rw-r--r-- 1 edward edward 1300381 Jul 19 00:15 available.ed.gz
1332 -rw-r--r-- 1 edward edward 1357048 Jul 19 00:11 available.gz
4 -rwxr--r-- 1 edward edward 382 Jul 19 00:10 compress*
$
55kb shaved off the .gz and 22kb off the .bz2
Now this was just a crude experiment. Imagine the gains in size that could be
made if this was done properly. Have four tables: field names, package names
section names, maintainer names. Don't bother storing them in the compressor,
let it work them out from the file so that it is more generic, and suddenly
you have smaller files.
I understand that this will not be used for the Debian package file, because
we are not even using bzip2 yet. The justification for not using it is that
bzip2 need about 4Mb of memory to run, and is slower than gzip2. This would be
bad on smaller machines, also another program has to move to main.
I wonder whether my idea is useful for anything? Can anybody see any flaws? Or
are the possible gains to small to worry about?
Small observation - I was just looking at the fields in the package file, I
spotted the `meta-package:' field, one I had not seen before. It only appears
on the netscape packages, how strange.
--
Don't worry -- shop.
Reply to: