Edward Betts <edward@debian.org> writes: > Field1: foo > Field2: bar > > Field1: baz > Field2: bar [...] > I was wondering how gzip and bzip2 would compress these, would it be clever > enough to see the repeating fields and just replace them with small > codes. Lempel-Ziv actually quite good at this. After a while, gzip will express "Description: " in one symbol of, say, 14 bit (which is better than your format fares). You're right about the blocks, though. > To looking like this: > > 0esh > 1optional > 2shells > 3296 [...] > 3988 -rw-r--r-- 1 edward edward 4078967 Jul 18 23:23 available > 904 -rw-r--r-- 1 edward edward 920133 Jul 19 00:17 available.bz2 > 3384 -rw-r--r-- 1 edward edward 3457391 Jul 19 00:14 available.ed > 880 -rw-r--r-- 1 edward edward 896602 Jul 19 00:17 available.ed.bz2 > 1276 -rw-r--r-- 1 edward edward 1300381 Jul 19 00:15 available.ed.gz > 1332 -rw-r--r-- 1 edward edward 1357048 Jul 19 00:11 available.gz > 4 -rwxr--r-- 1 edward edward 382 Jul 19 00:10 compress* > $ > > 55kb shaved off the .gz and 22kb off the .bz2 What you've essentially done is replaced the plain-text available file with a binary format understood only by special tools (no more grep '^Package:'). I don't think the 2 % saving is worth that, especially when the gzip -> bzip2 move gives 11 % improvement. If going binary, you could just as well go the whole way - this will save much more: numbers like size and MD5sum no longer need to be represented inefficiently, package references (in Depends-like fields) simply contain an offset or id instead of the package name, etc. IMHO having a binary available file format for available may not be that bad an idea (strictly speaking, you're not supposed to mess with it directly, anyway), but a half-assed solution serves noone. -- Robbe
Attachment:
signature.ng
Description: PGP signature