[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

better RSYNC mirroring , for .debs and others

hi everybody

I have implemented
a good idea for reducing download stress for everybody who is
mirroring a lot of data using rsync, 
like, the people who are mirroring Debian GNU/Linux:
currently, many Debian "leaf mirrors" are using rsync 
for mirroring from the main  .debian.org hosts.

rsync contains a wonderful algorithm to speedup downloads when mirroring
files which have only minor differences;
only problem is, this algorithm is ALMOST NEVER  used
when mirroring a debian repository
... indeed, whenever a new version of a
package is entered in the debianrepository,
this package has a different name: for this reason rsync  does just a
full download. 
Summarizing, rsync currently does some speedup only
when it downloads Packages.gz files, or when it skips an already existing

well, I have just implemented a simple
way to use the algorithm even when downloading the .debs .

here is a simple example

suppose the current situation is
whereas locally we have

when rsync looks for a local version of
if there is none, then rsync does
  ls -t     /debian/dist/bin/dpkg_*
and looks for the most recent file it finds

this way, rsync will use the file     /debian/dist/bin/dpkg_1.deb
to try to speedup the download of    $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
(using its fabulous algorithm)

BIG PRO: my new "rsync" is totally compatible with the old one

this idea would make all debian mirror-people  happier
(specially if they mirror "unstable"; consider that, often,
when a new version of a package is released, only small changes are made...
sometimes, only the .postinst , or such, are really changed;
this may , thou, masked by the compression, alas: but, see TODO)

I attach  two files: the first file is a diff, showing where, in
the "rsync 2.4.1" source code tree, I have done some modifications;
the second is a .tgz of the all the new and modified files you
need to build the new rsync: 
to build, first you need to download
the source code (see rsync.samba.org/rsync/download.html)
and then you unpack the file rsync.diffsrc.tgz in the tree code,
and build.

You may also get the compiled binary directly as 
and the new code alltogether in

there are some potentially good ideas here:

1) the idea is to add "modules" to rsync: 
  a "gzip" module, a "deb" module, and "rpm" module...;
  currently, modules just look for an older local version of the file;

  in a future version,  any module would
  apply to a certain type of file, and create
  another file to pass to "rsync"
  so that this another file  may probably lead to more speedup:  
  e.g., the "gzip" module would unzip files before doing comparisons,
  and the "deb" module would unzip the data.tar.gz part of a package

 CONS: this would not be backward compatible, of course
  The idea is, a module may provide  the following calls:
 Currently, only  find_alternative_version_deb() was implemented.

 If rsync uses only the find_alternative_version_MOD()
 calls, then it is "backward compatible" with the usual version:
 (in a sense , it is doing what the option  --compare-dest  already does,
  only in a smarter way)
 I have not currently implemented any    receive_file_MOD()
   send_file_MOD() : these would need a change in the protocol:
   I hope that the rsync authors will give permission

1b) My idea (not sure) is that "rsync" may work if provided with "named pipes"
 instead of files: indeed, according to the technical report,
 it needs to read the local and remote files only once, 
  and then, it writes the local file, without ever seeking backwards;
 then, the above modules would not need to actually
 use disk space and create temporary files.

2) for a faster apt-get downloading,
 it may be possible to do the same trick WHEN UPGRADING
 INSTALLED PACKAGES!  Here is the idea:
  "apt-get creates a local version of the package
  (using dpkg-repack)
  and do the rsync to get the remote version"

Andrea C. Mennucci,   Scuola Normale Superiore, Pisa, Italy
? modules
? zlib/dummy
Index: Makefile.in
RCS file: /cvsroot/rsync/Makefile.in,v
retrieving revision 1.39
diff -r1.39 Makefile.in
< 	lib/fnmatch.h lib/getopt.h lib/mdfour.h
> 	lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h
> MODULES_OBJ = modules/modules.o modules/deb.o
Index: generator.c
RCS file: /cvsroot/rsync/generator.c,v
retrieving revision 1.16
diff -r1.16 generator.c
> #include "modules/modules.h"
> #endif
< 			fnamecmp = fnamecmpbuf;
> 		  {
> 		    fnamecmp = fnamecmpbuf;
> 		    if (verbose > 1)
> 		      rprintf(FINFO,"recv_generator  opens %s\n",fnamecmp);
> 		  }
> 	}
> 	/* by A Mennucci. GPL
> 	   this piece will look for a previous version 
> 	   of the same file
> 	I think that rsync is somewhat a "spaghetti code":
> 	look at how many extern declarations it uses....
> 	and it is crazy that this check has to be done in two separate places
> 	*/
> 	if (statret == -1) {
> 	  char *nf;
> 	  int saveerrno = errno;
> 	  nf=find_alternative_version(fname);
> 	  if ( nf != NULL)
> 	    {
> 	      statret = link_stat(nf,&st);
> 	      if (!S_ISREG(st.st_mode))
> 		statret = -1;
> 	      if (statret == -1)
> 		{
> 		  perror("stat of suggested older version failed:");
> 		  errno = saveerrno;
> 		}
> 	      else
> 		{
> 		  fnamecmp = fnamecmpbuf;
> 		  strcpy(fnamecmp, nf);
> 		}
> 	      free (nf);
> 	    }
> #endif
Index: receiver.c
RCS file: /cvsroot/rsync/receiver.c,v
retrieving revision 1.28
diff -r1.28 receiver.c
> #include "modules/modules.h"
> #endif
> 		/* by A Mennucci.
> 		   this piece will look for a previous version 
> 		   of the same file */
> 		if ((fd1 == -1)) {
> 		  char *nf;
> 		  nf=find_alternative_version(fname);
> 		  if (nf!= NULL)
> 		    {
> 		      fnamecmp = fnamecmpbuf;
> 		      strcpy(fnamecmpbuf,nf);
> 		      fd1 = do_open(nf, O_RDONLY, 0);
> 		      if(fd1==-1) 
> 			perror("file candidate");
> 		      free(nf);
> 		    }
> 		}
> 		if (fd1 != -1 )
> 		  rprintf(FINFO,
> 			  "((candidate local oldfile for %s is %s))\n",
> 			  fname,fnamecmp);
> #endif

Attachment: rsync.diffsrc.tgz
Description: GNU Unix tar archive

Reply to: