[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Solving the compression dilema when rsync-ing Debian versions

>>>>> " " == Otto Wyss <otto.wyss@bluewin.ch> writes:

    >> >> > gzip --compress-like=old-foo foo
    >> gzip creates a dictionary (that gets realy large) of strings
    >> that are used and encodes references to them. At the start the
    >> dictionary is empty, so the first char is pretty much unencoded
    >> and inserted into the dictionary. The next char is encoded
    >> using the first one and so on. That way longer and longer
    >> strings enter the dictionary.
     > ...
    >>  So, as you see, that method is not possible.
     > Okay lets asume gzip knows anyhow the table of the previous
     > compression.  It starts compressing as usual, getting the first
     > value to encode. Now instead of just putting it at the first
     > position it looks in the old table and finds it at position
     > 3. Gzip puts it at position 3 leaving the first 2 unused. Now
     > everthing goes fine until a value isn't found. This time gzip
     > appends it to the end of the table. Of course at a certain
     > point these to table diverge to much so gzip starts using a new
     > table.

First its not possible.

Second, just assuming you could do it, you would need the old table
for unpacking, so without the old file the new one is just garbage.

The dictionary gzip uses is usually a reference to a previous entry +
another character. So, the third entry will reference the first or
second. Without knowing the first or second the third has no real

     > I don't know the outcome of such a compression but I think it
     > will much better correspond to the sources. Besides this
     > algorithmus could be used as gzip --compress=i386.gz

     > where i386 does contain a optimal table for i386 binaries. It
     > will give a better start compression rate while retaining an
     > adaptable compression and it allows to specify th optimal
     > compression for any case.

If I'm not mistaken, rar does this.

     > I don't think mine is the best solution and I don't know if its
     > working, it just gives an idea how the problem could be
     > solved. The principle is to use the old compression scheme and
     > adapt it as less as possible but as much as necessary.

The real solution and the only workable one is to teach rsync about
compression. You won't get people to change to a new compressor, just
look how long bzip2 is around, which is way better than gzip.
rsync on the other side only needs an update at the client side (I
still hope to make it work that way), so as soon as you install a then
current rsync all existing servers would give you the improved

    >> But there is a little optimisation that can be used (and is
    >> used by the --rsyncable patch):
     > This is of course a very good solution, I only wouldn't call it
     > --rsyncable. I wouldn't make it an option at all. Anyhow it's
     > not the NonPlusUltra solution, there are cases where it will
     > fail.

No, not realy. Of cause its all based on probabilities, but after a
few blocks the probability of failure to resync is basically 0.

    >> > Maybe it's time to design a compression alogrithmus which has
    >> > this functionality (same difference rate as the source) from
    >> > the ground up.
    >> There are such algorithms, but they eigther allys use the same
    >> dictionary or table (like some i386.exe runtime compressors
    >> that are specialiesed to the patterns used in opcodes) or they
    >> waste space by adding the dictionary/table to the compressed
    >> file. Thats a huge waste with all the small diff files we have.
     > These all have fixed compression, as far as I know there isn't
     > any which combines a fixed with an adaptable compression.

Yes, because having an adoptable algorithm contradicts the idea of
storing a fixed table in a file and if the table is fixed for a lot of
files, just let the compressor know the table.


Reply to: