Re: better RSYNC mirroring , for .debs and others
tom rothamel is working on a project called debdiff that works towards the
same goal. please read his announcment thread, which is archived at
http://www.debian.org/Lists-Archives/debian-devel-0002/msg00391.htm.
i like the idea of rsync modules, but the concept you project misses is that
even a small addition or subtraction in the beginning of a file ruins
rsync's speed bonus because it then has to send everything. take a look at
tom's code. i think you'll find it interesting.
Andrea Mennucc1 (debian@Tonelli.sns.it) wrote:
>
> hi everybody
>
> I have implemented
> a good idea for reducing download stress for everybody who is
> mirroring a lot of data using rsync,
> like, the people who are mirroring Debian GNU/Linux:
> currently, many Debian "leaf mirrors" are using rsync
> for mirroring from the main .debian.org hosts.
>
> rsync contains a wonderful algorithm to speedup downloads when mirroring
> files which have only minor differences;
> only problem is, this algorithm is ALMOST NEVER used
> when mirroring a debian repository
> ... indeed, whenever a new version of a
> package is entered in the debianrepository,
> this package has a different name: for this reason rsync does just a
> full download.
> Summarizing, rsync currently does some speedup only
> when it downloads Packages.gz files, or when it skips an already existing
> package.
>
> well, I have just implemented a simple
> way to use the algorithm even when downloading the .debs .
>
> here is a simple example
>
> suppose the current situation is
> $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
> whereas locally we have
> /debian/dist/bin/dpkg_1.deb
>
> when rsync looks for a local version of
> /debian/dist/bin/dpkg_2.deb
> if there is none, then rsync does
> ls -t /debian/dist/bin/dpkg_*
> and looks for the most recent file it finds
>
> this way, rsync will use the file /debian/dist/bin/dpkg_1.deb
> to try to speedup the download of $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
> (using its fabulous algorithm)
>
> BIG PRO: my new "rsync" is totally compatible with the old one
>
> Conclusion:
> this idea would make all debian mirror-people happier
> (specially if they mirror "unstable"; consider that, often,
> when a new version of a package is released, only small changes are made...
> sometimes, only the .postinst , or such, are really changed;
> this may , thou, masked by the compression, alas: but, see TODO)
>
> I attach two files: the first file is a diff, showing where, in
> the "rsync 2.4.1" source code tree, I have done some modifications;
> the second is a .tgz of the all the new and modified files you
> need to build the new rsync:
> to build, first you need to download
> the source code (see rsync.samba.org/rsync/download.html)
> and then you unpack the file rsync.diffsrc.tgz in the tree code,
> and build.
>
> You may also get the compiled binary directly as
> ftp://tonelli.sns.it/pub/rsync/rsync
> and the new code alltogether in
> ftp://tonelli.sns.it/pub/rsync
>
> TODO:
> there are some potentially good ideas here:
>
> 1) the idea is to add "modules" to rsync:
> a "gzip" module, a "deb" module, and "rpm" module...;
> currently, modules just look for an older local version of the file;
>
> in a future version, any module would
> apply to a certain type of file, and create
> another file to pass to "rsync"
> so that this another file may probably lead to more speedup:
> e.g., the "gzip" module would unzip files before doing comparisons,
> and the "deb" module would unzip the data.tar.gz part of a package
>
> CONS: this would not be backward compatible, of course
>
> The idea is, a module may provide the following calls:
> find_alternative_version_MOD()
> receive_file_MOD()
> send_file_MOD()
>
> Currently, only find_alternative_version_deb() was implemented.
>
> If rsync uses only the find_alternative_version_MOD()
> calls, then it is "backward compatible" with the usual version:
> (in a sense , it is doing what the option --compare-dest already does,
> only in a smarter way)
>
> I have not currently implemented any receive_file_MOD()
> send_file_MOD() : these would need a change in the protocol:
> I hope that the rsync authors will give permission
>
> 1b) My idea (not sure) is that "rsync" may work if provided with "named pipes"
> instead of files: indeed, according to the technical report,
> it needs to read the local and remote files only once,
> and then, it writes the local file, without ever seeking backwards;
> then, the above modules would not need to actually
> use disk space and create temporary files.
>
>
> 2) for a faster apt-get downloading,
> it may be possible to do the same trick WHEN UPGRADING
> INSTALLED PACKAGES! Here is the idea:
> "apt-get creates a local version of the package
> (using dpkg-repack)
> and do the rsync to get the remote version"
>
>
>
> --
> Andrea C. Mennucci, Scuola Normale Superiore, Pisa, Italy
--
(jacob kuntz) jpk@cape.com jake@{megabite,underworld}.net
(megabite systems) "think free speech, not free beer."
Reply to: