[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

A success story with apt and rsync



Hi,

>From time to time the question arises on different forums whether it is
possible to efficiently use rsync with apt-get. Recently there has been a
thread here on debian-devel and it was also mentioned in Debian Weekly News
June 24th, 2003. However, I only saw different small parts of a huge and
complex problem set discussed at different places, I haven't find an
overview of the whole situation anywhere.

Being one of the developers of the Hungarian distribution ``UHU-Linux'' I
spent some time in the last few days by collecting as much information as
possible, putting the patches together and coding a little bit to fill some
minor gaps. Here I'd like to summarize and share all my experiences.

Our distribution uses dpkg/apt for package management. We are not Debian
based though, even our build procedure which leads to deb packages is
completely different from Debian's, except for the last step (which is
obviously a ``dpkg-deb --build'').

Some of our resources are quite tight. This is especially true for the
bandwidth of the home machines of the developers and testers. Most of us
live behind a 384kbit/s ADSL line. From time to time we rebuild all our
packages to see if they still compile with our current packages. Such a full
rebuild produces 1.5GB of new packages, all with a new filename since the
release numbers are automatically bumped. Our goal was to reach that an
upgrade after such a full rebuild requires only a reasonable amount of
network traffic instead of one and a half gigabytes. Before telling how we
succeeded in it I'd like to demonstrate the result.


One of my favorite games is quadra. The size of the package is nearly 3MB.
I've purged it from my system and then performed an ``apt-get install
quadra''. Apt-get printed this, amongst others:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.8 [2931kB]
Fetched 2931kB in 59s (49,0kB/s)

The download speed and time corresponds to the 384kbit/s bandwidth.

I recompiled the package on the server. Then I typed ``apt-get update''
followed by ``apt-get install quadra'' again. This time apt-get printed
this:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.9 [2931kB]
Fetched 2931kB in 3s (788kB/s)

Yes, downloading only took three seconds instead of one minute. Obviously
these two files do not only differ in their filename, they contain their
release number, timestamps of files and perhaps other pieces of data which
make them different. Needless to say that a small change in the package
would only slightly increase the download time.

Speedup is usually approx. 2x--3x for packages containing lots of small
files, but can be extremely high for packages containing bigger files.


The rest of my mail tells the implementation details.


rsyncable gzip files
--------------------

A small change in a file causes their gzipped version to get out of sync and
hence rsync doesn't see any common parts in them. There's a patch by Rusty
Russell floating around on the net which adds an --rsyncable option to gzip.
It is already included in Debian. This way gzipped files have
synchronization points making rsync's job much easier. The patch is
available (amongst others) at [1a] and [1b].

The documentation in the original patch says ``This reduces compression by
about 1 percent most cases''. Debian's version says ``This increases size by
less than 1 percent most cases''. Size increasement was 0.7% for all our
packages, but 1.2% for our most important packages (the core distrib in
about 300--400MB).

This 1% is very low if you think of it as 1%. If you think of it as you lose
6MB on every CD, then, well, it could have been smaller. But if you think of
what you gain with it, then it is definitely worth it.

The same patch also exists for zlib (see [2a] or [2b]). However as for gzip
you can control this behaviour with a command line option, it is not so
trivial to do it with a library. The official patch disables rsyncable
support by default. You can enable it by changing "zlib_rsync = 0" to
"zlib_rsync = 1" within zlib's source or you can control it from your
running application. As I didn't like these approaches, I added a small
patch so that setting the ZLIB_RSYNC environment variable turns on the
rsyncable support. This patch is at [3].

As dpkg seems to statically link against zlib, we had to recompile dpkg
after installing this patched zlib. After this we changed our build script
so that it invokes ``dpkg-deb --build'' with the ZLIB_RSYNC environment
variable set to some value.


order of files
--------------

dpkg-deb puts the files in the .deb package in random order. I hate this
misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in
mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which
makes the files appear in the package in alphabetical order. I don't know
how efficient rsync is if you split a file to some dozens or even hundreds
of parts and shuffle them, and then syncronize this one with the original
version. Anyway, I'm sure that sorting the files cannot hurt rsync, it can
only help. I only guess that it really does help a lot.


similar filenames in rsync
--------------------------

Whenever we rebuild a package, it gets different filename, as the release
number is increased. If a file has different name, it is a completely
different file in rsync's eyes. There's a patch for rsync (yet again by
Rusty) which adds support for fuzzy filenames: when downloading a file, it
is merged to the local file with the most similar filename. This patch is
available inside the official rsync 2.5.6 tarball or at [5], however, it
only applies to rsync 2.5.4.

Unfortunately I was unable to port this patch to 2.5.6 in a reasonable time
so we have an rsync 2.5.6 package without fuzzy support, and an rsync-fuzzy
2.5.4 package.


rsync method in apt
-------------------

Sviatoslav Sviridoff created a patch for apt which adds rsync support (rsync
needs to be patched, too). See it at [6]. It cleanly applies to apt 0.5.5.1.
Decoded versions of these base64 patches are also available at [7] (for
apt), [8a] (for rsync, ported to 2.5.4) or [8b] (for rsync, ported to 2.5.6)
and [9] (for rsync versions up to 2.5.5, it's already included in rsync
2.5.6).


the gap
-------

Sviatoslav's patch makes apt use rsync, but it has nothing to do with
similar filenames, it downloads the files from scratch. Hence it is useful
to replace brain-damaged FTP by a sane protocol, however, it cannot save
network traffic on its own.

Apt asks its method helper binary (http, ftp, rsync...) to download the
files into a temporary directory (/var/cache/apt/archives/partial) and
later moves the files to their final place (/var/cache/apt/archives).
However, rsync --fuzzy only looks for similar filenames in the directory
where the new file is downloaded to. The solution would be to use the
--compare-dest option of rsync if it worked the way I expect it to work.
However, it works differently, see [10] for details.

To fill this gap I created a quick&ugly patch for rsync 2.5.4 which
introduces a --compare-fuzzy-dest option which does what we need for apt.
Get it from [11]. Furthermore, apt also needs a minor patch to call rsync
with the new options [12]. (This patch is ugly since it contains a
hard-coded path (/var/cache/apt/archives). It also renames the default
executable to rsync-fuzzy, which might not be what you want.)


conclusion
----------

The good news is that it is working perfectly.

The bad news is that you can't hack it on your home computer as long as your
distribution doesn't provide rsync-friendly packages. Maybe one could set up
a public rsync server with high bandwidth that keeps syncing the official
packages and repacks them with rsync-friendly gzip/zlib and sorting the
files.



cheers,
Egmont

Ps. Please CC me if you reply, I'm not subscribed.



[1a] http://ozlabs.org/~rusty/gzip.rsync.patch2
[1b] https://svn.uhulinux.hu/packages/dev/gzip/patches/01-rsync.patch
[2a] http://moin.conectiva.com.br/files/CompressedRsync/attachments/zlib-1.1.4-rsync.patch
[2b] https://svn.uhulinux.hu/packages/dev/zlib/patches/02-rsync.patch
[3]  https://svn.uhulinux.hu/packages/dev/zlib/patches/03-rsync-from-env.patch
[4a] http://freshmeat.net/projects/sortdir/
[4b] ftp://ftp.uhulinux.hu/pub/sources/sortdir/sortdir-0.3.1.tar.gz
[5]  https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/02-fuzzy.patch
[6]  http://distro2.conectiva.com.br/pipermail/apt-rpm/2003-January/001085.html
[7]  https://svn.uhulinux.hu/packages/dev/apt/patches/03-rsync-method.patch
[8a] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/04-apt-support.patch
[8b] https://svn.uhulinux.hu/packages/dev/rsync/patches/02-apt-support.patch
[9]  https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/03-cleanup.patch
[10] http://lists.samba.org/pipermail/rsync/2003-July/011209.html
[11] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/05-compare-fuzzy-dest.patch
[12] https://svn.uhulinux.hu/packages/dev/apt/patches/04-rsync-method-fuzzy.patch

If you can't find a file under https://svn.uhulinux.hu/ then try to list
directories and take a look at other files. If the directory ``rsync-fuzzy''
doesn't exist then it means I've managed to port the fuzzy patch to 2.5.6
and hence look for them under the directory ``rsync''.



Reply to: