[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Solving the compression dilema when rsync-ing Debian versions



On Sun, Jan 14, 2001 at 01:51:02AM +0100, Goswin Brederlow wrote:
> >>>>> " " == Richard Atterer <deb-devel@list.atterer.net> writes:
>      > Just how does it work, pray tell?  Is the patch and/or a more
>      > detailed description available somewhere?
> 
> From time to time gzip will flush the dictionary and start with a
> clean slate.
> 
> The trick now is to make this happen at special points in the file
> that don't change when the file is altered. To do this the rolling
> checksum algorithm (alder-32) is done for a 4K block and, when the
> result is equal to a magic (0), a flush is forced.

Ah, the magic rolling checksum value is the "missing link"!

But I'm surprised that the value 0, one out of 2^32 possible Adler32
checksum values, appears so often in typical data to make the scheme
work?! Seems like Adler32 isn't so strong a checksum after all. :-/

BTW, 0 is the Adler32 of an all-zeroes area - if the uncompressed data
contains long runs of zero, there will be *lots* of flushes unless
special action is taken.

> This forced flush happens at random places and not too often
> (increases linux.tar.gz by ~3%).

Am I guessing correctly that the value 0 was only chosen "randomly",
not for any particular reason, and that a zero rolling checksum only
occurs every MB or so?

By altering the size of the area from the default 4k, you can even
have a smooth trade-off between compression ratio and rsync transfer
volume - nice!

Thanks for the explanation!
Cheers,

  Richard

-- 
  __   _
  |_) /|  Richard Atterer                      | CS student at the Technische
  | \/¯|  http://atterer.net                   | Universität München, Germany
  ¯ ´` ¯



Reply to: