[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debdiff project announcement.



On Fri, Feb 04, 2000 at 10:43:18PM -0000, Tom Rothamel wrote:
> In tom.lists.debian-devel, you wrote:
> > wouldn't it be easier to create an rsync method for apt?
> 
> I already looked at this, as well as xdelta. The problem is that a
> small change in a file leads to a big change in a .tar.gz containing
> that file, making this less than optimal.

   Note that recompiling a binary will result in a new timestamp even if the
binary is identicle, and a new version requires a change in changelog.gz, so
small changes near the beginning of data.tar.gz are unavoidable.

   At one extreme we could turn off compression in the .deb and rely on the
transfer method to provide compression.  This extreme is pretty easy to
quantify, I see about 50% savings between libc6-2.1.2-11_i386 and 2.1.2-12.
binary_all and source packages generally show much greater savings.
There might even be significant amounts of savings between architectures
(all the binary-all data that ends up in binary-arch .debs).  A heavily
customized rsynd/rsync set could recognize .debs and unpack/repack them
before transfer.  This would result in the highest bandwidth savings but
would often change the md5sums.  While this might be acceptable to most end
users it would be disturbing on a mirror.

   At the other extreme we could bzip2 instead of gzip and lose small memory
machines, and gain nothing for small changes (the current situation anyway).

   Andrew Tridgell (rsync) came up with the following strategy about a year
ago but has been working on other projects and is hoping someone else
implements it inside zlib.
   At a cost of a few % compression efficiency, gzip could be modified to
flush the dictionary more frequently than the current 32kB based on the
value of a rolling checksum.  This would cause the following compressed data
to have identical values, up until the next change (or a bit earlier
assuming the dictionary changes as well).  This wouldn't be as rsyncable
as raw data (the best), and won't be as small as the current gzip'd data.
It's a compromise.  The advantages of this strategy is that the resulting
.deb has the same md5sum and there is no need to retain old .debs or diffs
on the server.  The disadvantage is the lost efficiency.  The trade is
adjustable depending on block length, with no compression and no incremental
updates at either extreme.  I'm playing around with the rolling checksums
and block lengths but have nothing to report yet.

   To summarize bandwidth reduction strategies:

1) bzip2
2) remove/reduce compression in the deb and use rsync
3) modify rsyncd/rsync to uncompress and recompress .debs
4) distribute xdelta diffs on the uncompressed control.tar and data.tar
5) local builder, incremental source update method

1) sacrifices small memory and slow machines for a small size reduction
2) increase the archive size
3) loses md5sums and requires modified rsync software
4) increase archive size and production complexity, mirroring complexity
5) not as reliable and increases client requirements significantly

-Drake


Reply to: