[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Field compression



Edward Betts <edward@debian.org> writes:

> Field1: foo
> Field2: bar
> 
> Field1: baz
> Field2: bar
[...]
> I was wondering how gzip and bzip2 would compress these, would it be clever
> enough to see the repeating fields and just replace them with small
> codes.

Lempel-Ziv actually quite good at this. After a while, gzip will express
"Description: " in one symbol of, say, 14 bit (which is better than
your format fares). You're right about the blocks, though.

> To looking like this:
> 
> 0esh
> 1optional
> 2shells
> 3296
[...]
> 3988 -rw-r--r--    1 edward   edward    4078967 Jul 18 23:23 available
>  904 -rw-r--r--    1 edward   edward     920133 Jul 19 00:17 available.bz2
> 3384 -rw-r--r--    1 edward   edward    3457391 Jul 19 00:14 available.ed
>  880 -rw-r--r--    1 edward   edward     896602 Jul 19 00:17 available.ed.bz2
> 1276 -rw-r--r--    1 edward   edward    1300381 Jul 19 00:15 available.ed.gz
> 1332 -rw-r--r--    1 edward   edward    1357048 Jul 19 00:11 available.gz
>    4 -rwxr--r--    1 edward   edward        382 Jul 19 00:10 compress*
> $
> 
> 55kb shaved off the .gz and 22kb off the .bz2

What you've essentially done is replaced the plain-text available file
with a binary format understood only by special tools (no more grep
'^Package:'). I don't think the 2 % saving is worth that, especially
when the gzip -> bzip2 move gives 11 % improvement.

If going binary, you could just as well go the whole way - this will
save much more: numbers like size and MD5sum no longer need to be
represented inefficiently, package references (in Depends-like fields)
simply contain an offset or id instead of the package name, etc.

IMHO having a binary available file format for available may not be
that bad an idea (strictly speaking, you're not supposed to mess with
it directly, anyway), but a half-assed solution serves noone.

-- 
Robbe

Attachment: signature.ng
Description: PGP signature


Reply to: