[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Solving the compression dilema when rsync-ing Debian versions



>    >> > gzip --compress-like=old-foo foo
>
>gzip creates a dictionary (that gets realy large) of strings that are
>used and encodes references to them. At the start the dictionary is
>empty, so the first char is pretty much unencoded and inserted into
>the dictionary. The next char is encoded using the first one and so
>on. That way longer and longer strings enter the dictionary.
>
...
>
>So, as you see, that method is not possible.
>
Okay lets asume gzip knows anyhow the table of the previous compression.
It starts compressing as usual, getting the first value to encode. Now
instead of just putting it at the first position it looks in the old
table and finds it at position 3. Gzip puts it at position 3 leaving the
first 2 unused. Now everthing goes fine until a value isn't found. This
time gzip appends it to the end of the table. Of course at a certain
point these to table diverge to much so gzip starts using a new table.

I don't know the outcome of such a compression but I think it will much
better correspond to the sources. Besides this algorithmus could be used as
	gzip --compress=i386.gz

where i386 does contain a optimal table for i386 binaries. It will give
a better start compression rate while retaining an adaptable compression
and it allows to specify th optimal compression for any case.

I don't think mine is the best solution and I don't know if its working,
it just gives an idea how the problem could be solved. The principle is
to use the old compression scheme and adapt it as less as possible but
as much as necessary.

>But there is a little optimisation that can be used (and is used by
>the --rsyncable patch):
>
This is of course a very good solution, I only wouldn't call it
--rsyncable. I wouldn't make it an option at all. Anyhow it's not the
NonPlusUltra solution, there are cases where it will fail.

>     > Maybe it's time to design a compression alogrithmus which has
>     > this functionality (same difference rate as the source) from
>     > the ground up.
>
>There are such algorithms, but they eigther allys use the same
>dictionary or table (like some i386.exe runtime compressors that are
>specialiesed to the patterns used in opcodes) or they waste space by
>adding the dictionary/table to the compressed file. Thats a huge waste
>with all the small diff files we have.
>
These all have fixed compression, as far as I know there isn't any which
combines a fixed with an adaptable compression. 

O. Wyss



Reply to: