hi everybody
I have implemented
a good idea for reducing download stress for everybody who is
mirroring a lot of data using rsync,
like, the people who are mirroring Debian GNU/Linux:
currently, many Debian "leaf mirrors" are using rsync
for mirroring from the main .debian.org hosts.
rsync contains a wonderful algorithm to speedup downloads when mirroring
files which have only minor differences;
only problem is, this algorithm is ALMOST NEVER used
when mirroring a debian repository
... indeed, whenever a new version of a
package is entered in the debianrepository,
this package has a different name: for this reason rsync does just a
full download.
Summarizing, rsync currently does some speedup only
when it downloads Packages.gz files, or when it skips an already existing
package.
well, I have just implemented a simple
way to use the algorithm even when downloading the .debs .
here is a simple example
suppose the current situation is
$REMOTE::/pub/debian/dist/bin/dpkg_2.deb
whereas locally we have
/debian/dist/bin/dpkg_1.deb
when rsync looks for a local version of
/debian/dist/bin/dpkg_2.deb
if there is none, then rsync does
ls -t /debian/dist/bin/dpkg_*
and looks for the most recent file it finds
this way, rsync will use the file /debian/dist/bin/dpkg_1.deb
to try to speedup the download of $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
(using its fabulous algorithm)
BIG PRO: my new "rsync" is totally compatible with the old one
Conclusion:
this idea would make all debian mirror-people happier
(specially if they mirror "unstable"; consider that, often,
when a new version of a package is released, only small changes are made...
sometimes, only the .postinst , or such, are really changed;
this may , thou, masked by the compression, alas: but, see TODO)
I attach two files: the first file is a diff, showing where, in
the "rsync 2.4.1" source code tree, I have done some modifications;
the second is a .tgz of the all the new and modified files you
need to build the new rsync:
to build, first you need to download
the source code (see rsync.samba.org/rsync/download.html)
and then you unpack the file rsync.diffsrc.tgz in the tree code,
and build.
You may also get the compiled binary directly as
ftp://tonelli.sns.it/pub/rsync/rsync
and the new code alltogether in
ftp://tonelli.sns.it/pub/rsync
TODO:
there are some potentially good ideas here:
1) the idea is to add "modules" to rsync:
a "gzip" module, a "deb" module, and "rpm" module...;
currently, modules just look for an older local version of the file;
in a future version, any module would
apply to a certain type of file, and create
another file to pass to "rsync"
so that this another file may probably lead to more speedup:
e.g., the "gzip" module would unzip files before doing comparisons,
and the "deb" module would unzip the data.tar.gz part of a package
CONS: this would not be backward compatible, of course
The idea is, a module may provide the following calls:
find_alternative_version_MOD()
receive_file_MOD()
send_file_MOD()
Currently, only find_alternative_version_deb() was implemented.
If rsync uses only the find_alternative_version_MOD()
calls, then it is "backward compatible" with the usual version:
(in a sense , it is doing what the option --compare-dest already does,
only in a smarter way)
I have not currently implemented any receive_file_MOD()
send_file_MOD() : these would need a change in the protocol:
I hope that the rsync authors will give permission
1b) My idea (not sure) is that "rsync" may work if provided with "named pipes"
instead of files: indeed, according to the technical report,
it needs to read the local and remote files only once,
and then, it writes the local file, without ever seeking backwards;
then, the above modules would not need to actually
use disk space and create temporary files.
2) for a faster apt-get downloading,
it may be possible to do the same trick WHEN UPGRADING
INSTALLED PACKAGES! Here is the idea:
"apt-get creates a local version of the package
(using dpkg-repack)
and do the rsync to get the remote version"
--
Andrea C. Mennucci, Scuola Normale Superiore, Pisa, Italy
? modules
? zlib/dummy
Index: Makefile.in
===================================================================
RCS file: /cvsroot/rsync/Makefile.in,v
retrieving revision 1.39
diff -r1.39 Makefile.in
24c24
< lib/fnmatch.h lib/getopt.h lib/mdfour.h
---
> lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h
32c32,33
< OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ)
---
> MODULES_OBJ = modules/modules.o modules/deb.o
> OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ)
Index: generator.c
===================================================================
RCS file: /cvsroot/rsync/generator.c,v
retrieving revision 1.16
diff -r1.16 generator.c
19a20,23
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
>
311c315,349
< fnamecmp = fnamecmpbuf;
---
> {
> fnamecmp = fnamecmpbuf;
> if (verbose > 1)
> rprintf(FINFO,"recv_generator opens %s\n",fnamecmp);
> }
> }
> #ifndef NODEBIANVERSIONER
> /* by A Mennucci. GPL
> this piece will look for a previous version
> of the same file
> I think that rsync is somewhat a "spaghetti code":
> look at how many extern declarations it uses....
> and it is crazy that this check has to be done in two separate places
> */
> if (statret == -1) {
> char *nf;
> int saveerrno = errno;
> nf=find_alternative_version(fname);
> if ( nf != NULL)
> {
> statret = link_stat(nf,&st);
> if (!S_ISREG(st.st_mode))
> statret = -1;
> if (statret == -1)
> {
> perror("stat of suggested older version failed:");
> errno = saveerrno;
> }
> else
> {
> fnamecmp = fnamecmpbuf;
> strcpy(fnamecmp, nf);
> }
> free (nf);
> }
312a351
> #endif
Index: receiver.c
===================================================================
RCS file: /cvsroot/rsync/receiver.c,v
retrieving revision 1.28
diff -r1.28 receiver.c
18a19,21
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
21a25
>
375a380,401
> #ifndef NODEBIANVERSIONER
> /* by A Mennucci.
> this piece will look for a previous version
> of the same file */
> if ((fd1 == -1)) {
> char *nf;
> nf=find_alternative_version(fname);
> if (nf!= NULL)
> {
> fnamecmp = fnamecmpbuf;
> strcpy(fnamecmpbuf,nf);
> fd1 = do_open(nf, O_RDONLY, 0);
> if(fd1==-1)
> perror("file candidate");
> free(nf);
> }
> }
> if (fd1 != -1 )
> rprintf(FINFO,
> "((candidate local oldfile for %s is %s))\n",
> fname,fnamecmp);
> #endif
Attachment:
rsync.diffsrc.tgz
Description: GNU Unix tar archive