[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Solving the compression dilema when rsync-ing Debian versions



>>>>> " " == Otto Wyss <otto.wyss@bluewin.ch> writes:

    >> > gzip --compress-like=old-foo foo
    >> 
    >> AFAIK thats NOT possible with gzip. Same with bzip2.
    >> 
     > Why not.

gzip creates a dictionary (that gets realy large) of strings that are
used and encodes references to them. At the start the dictionary is
empty, so the first char is pretty much unencoded and inserted into
the dictionary. The next char is encoded using the first one and so
on. That way longer and longer strings enter the dictionary.

Every sequence of bytes creates an unique (maybe not unique, but
pretty much so) dictionary that can be completly reconstructed from
the compressed data. Given the dictionary after the first n characters
the n+1's characted can be decoded and the next dictionary can be
calculated.

I think its pretty much impossible to find two files resulting in the
same dictionary. It certainly is impossible for the speed we need.

You cannot encoded two random files with the same dictionary, not
without adding the dictionary to the file, which gzip does not (since
its a waste).

So, as you see, that method is not possible.

But there is a little optimisation that can be used (and is used by
the --rsyncable patch):

If the dictionary gets to big, the compression ratio drops. It becomes
ineffective. Then gzip flushes the dictionary and starts again with an
empty one.

The --rsyncable patch now changes the moments when that will
happen. It looks for block of bytes that have a certain rolling
checksum and if it matches it flushes the dictionary. Most likely two
similar files will therefore flush the dictionary at exactly the same
places. If two files are equal after such a flush, the data will be
encoded the same way and rsync can match those blocks.

The author claims that it takes about 0.1-0.2% more space for
rsyncable gzip files, which is a loss I think everybody is willing to
pay.

    >> I wish it where that simple.
    >> 
     > I'm not saying it's simple, I'm saying it's possible. I'm not a
     > compression speciallist but from the theory there is nothing
     > which prevents this except from the actual implementation.

     > Maybe it's time to design a compression alogrithmus which has
     > this functionality (same difference rate as the source) from
     > the ground up.

There are such algorithms, but they eigther allys use the same
dictionary or table (like some i386.exe runtime compressors that are
specialiesed to the patterns used in opcodes) or they waste space by
adding the dictionary/table to the compressed file. Thats a huge waste
with all the small diff files we have.


The --rsyncable patch looks promising for a start and will greatly
reduce the downloads for source mirrors, if its used.

MfG
        Goswin



Reply to: