Re: Solving the compression dilema when rsync-ing Debian versions
On Sun, Jan 14, 2001 at 01:51:02AM +0100, Goswin Brederlow wrote:
> >>>>> " " == Richard Atterer <firstname.lastname@example.org> writes:
> > Just how does it work, pray tell? Is the patch and/or a more
> > detailed description available somewhere?
> From time to time gzip will flush the dictionary and start with a
> clean slate.
> The trick now is to make this happen at special points in the file
> that don't change when the file is altered. To do this the rolling
> checksum algorithm (alder-32) is done for a 4K block and, when the
> result is equal to a magic (0), a flush is forced.
Ah, the magic rolling checksum value is the "missing link"!
But I'm surprised that the value 0, one out of 2^32 possible Adler32
checksum values, appears so often in typical data to make the scheme
work?! Seems like Adler32 isn't so strong a checksum after all. :-/
BTW, 0 is the Adler32 of an all-zeroes area - if the uncompressed data
contains long runs of zero, there will be *lots* of flushes unless
special action is taken.
> This forced flush happens at random places and not too often
> (increases linux.tar.gz by ~3%).
Am I guessing correctly that the value 0 was only chosen "randomly",
not for any particular reason, and that a zero rolling checksum only
occurs every MB or so?
By altering the size of the area from the default 4k, you can even
have a smooth trade-off between compression ratio and rsync transfer
volume - nice!
Thanks for the explanation!
|_) /| Richard Atterer | CS student at the Technische
| \/¯| http://atterer.net | Universität München, Germany
¯ ´` ¯