[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Solving the compression dilema when rsync-ing Debian versions



>>>>> " " == Otto Wyss <otto.wyss@bluewin.ch> writes:

     > It's commonly agreed that compression does prevent rsync from
     > profit of older versions of packages when synchronizing Debian
     > mirrors. All the discussion about fixing rsync to solve this,
     > even trough a deb-plugin is IMHO not the right way. Rsync's
     > task is to synchronize files without knowing what's inside.

     > So why not solve the compression problem at the root? Why not
     > try to change the compression in a way so it does produce a
     > compressed result with the same (or similar) difference rate as
     > the source?

     > As my understanding of compression goes, all have a kind of
     > lookup table at the beginning where all compression codes where
     > declared. Each time this table is created new, each time
     > slightly different than the previous one depending on the

Nope. Only a few compression programs use a table at the start of the
file. Most build the table as they go along. Saves a lot of
information not to copy the table.

gzip (I hope I remeber that correctly) for example increases its table
with every character it encodes, so when you compress a file that does
only contain 0, the table will not contain any a's, so a can't even be
encoded.

bzip2 on the other hand resorts the input in some way to get better
compression ratios. You can't resort the input in the same way with
different data. The compression rate will dramatically drop otherwise.

ppm, as a third example, builds a new table for every character thats
transfered and encoded the probability range of the real character in
one of the current contexts. And the contexts are based on all
previous characters. The first character will be plain text and the
rest of the file will (most likely) differ if that char changes.

     > source. So to get similar results when compressing means using
     > the same or at least an aquivalent lookup table.  If it would
     > be possible to feed the lookup table of the previous compressed
     > file to the new compression process, an equal or at least
     > similar compression could be achieved.

     > Of course using allways the same lookup table means a deceasing
     > of the compression rate. If there is an algorithmus which
     > compares the old rate with an optimal rate, even this could be
     > solved. This means a completly different compression from time
     > to time. All depends how easy an aquivalent lookup table could
     > be created without loosing to much of the compression rate.

Knowing the structure of the data can greatly increase the compression
ratio. Also knowing the structure can greatly reduce the differences
needed to sync two files.

So why should rsync stay stupid?

MfG
        Goswin



Reply to: