[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

better RSYNC mirroring , for .debs and others



hi everybody

I have implemented
a good idea for reducing download stress for everybody who is
mirroring a lot of data using rsync, 
like, the people who are mirroring Debian GNU/Linux:
currently, many Debian "leaf mirrors" are using rsync 
for mirroring from the main  .debian.org hosts.

rsync contains a wonderful algorithm to speedup downloads when mirroring
files which have only minor differences;
only problem is, this algorithm is ALMOST NEVER  used
when mirroring a debian repository
... indeed, whenever a new version of a
package is entered in the debianrepository,
this package has a different name: for this reason rsync  does just a
full download. 
Summarizing, rsync currently does some speedup only
when it downloads Packages.gz files, or when it skips an already existing
package.

well, I have just implemented a simple
way to use the algorithm even when downloading the .debs .

here is a simple example

suppose the current situation is
    $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
whereas locally we have
    /debian/dist/bin/dpkg_1.deb

when rsync looks for a local version of
    /debian/dist/bin/dpkg_2.deb
if there is none, then rsync does
  ls -t     /debian/dist/bin/dpkg_*
and looks for the most recent file it finds

this way, rsync will use the file     /debian/dist/bin/dpkg_1.deb
to try to speedup the download of    $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
(using its fabulous algorithm)

BIG PRO: my new "rsync" is totally compatible with the old one

Conclusion:
this idea would make all debian mirror-people  happier
(specially if they mirror "unstable"; consider that, often,
when a new version of a package is released, only small changes are made...
sometimes, only the .postinst , or such, are really changed;
this may , thou, masked by the compression, alas: but, see TODO)

I attach  two files: the first file is a diff, showing where, in
the "rsync 2.4.1" source code tree, I have done some modifications;
the second is a .tgz of the all the new and modified files you
need to build the new rsync: 
to build, first you need to download
the source code (see rsync.samba.org/rsync/download.html)
and then you unpack the file rsync.diffsrc.tgz in the tree code,
and build.

You may also get the compiled binary directly as 
 ftp://tonelli.sns.it/pub/rsync/rsync
and the new code alltogether in
 ftp://tonelli.sns.it/pub/rsync

TODO:
there are some potentially good ideas here:

1) the idea is to add "modules" to rsync: 
  a "gzip" module, a "deb" module, and "rpm" module...;
  currently, modules just look for an older local version of the file;

  in a future version,  any module would
  apply to a certain type of file, and create
  another file to pass to "rsync"
  so that this another file  may probably lead to more speedup:  
  e.g., the "gzip" module would unzip files before doing comparisons,
  and the "deb" module would unzip the data.tar.gz part of a package

 CONS: this would not be backward compatible, of course
  
  The idea is, a module may provide  the following calls:
   find_alternative_version_MOD()
   receive_file_MOD()
   send_file_MOD()
   
 Currently, only  find_alternative_version_deb() was implemented.

 If rsync uses only the find_alternative_version_MOD()
 calls, then it is "backward compatible" with the usual version:
 (in a sense , it is doing what the option  --compare-dest  already does,
  only in a smarter way)
 
 I have not currently implemented any    receive_file_MOD()
   send_file_MOD() : these would need a change in the protocol:
   I hope that the rsync authors will give permission

1b) My idea (not sure) is that "rsync" may work if provided with "named pipes"
 instead of files: indeed, according to the technical report,
 it needs to read the local and remote files only once, 
  and then, it writes the local file, without ever seeking backwards;
 then, the above modules would not need to actually
 use disk space and create temporary files.


2) for a faster apt-get downloading,
 it may be possible to do the same trick WHEN UPGRADING
 INSTALLED PACKAGES!  Here is the idea:
  "apt-get creates a local version of the package
  (using dpkg-repack)
  and do the rsync to get the remote version"
 


-- 
Andrea C. Mennucci,   Scuola Normale Superiore, Pisa, Italy
? modules
? zlib/dummy
Index: Makefile.in
===================================================================
RCS file: /cvsroot/rsync/Makefile.in,v
retrieving revision 1.39
diff -r1.39 Makefile.in
24c24
< 	lib/fnmatch.h lib/getopt.h lib/mdfour.h
---
> 	lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h
32c32,33
< OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ)
---
> MODULES_OBJ = modules/modules.o modules/deb.o
> OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ)
Index: generator.c
===================================================================
RCS file: /cvsroot/rsync/generator.c,v
retrieving revision 1.16
diff -r1.16 generator.c
19a20,23
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
> 
311c315,349
< 			fnamecmp = fnamecmpbuf;
---
> 		  {
> 		    fnamecmp = fnamecmpbuf;
> 		    if (verbose > 1)
> 		      rprintf(FINFO,"recv_generator  opens %s\n",fnamecmp);
> 		  }
> 	}
> #ifndef NODEBIANVERSIONER
> 	/* by A Mennucci. GPL
> 	   this piece will look for a previous version 
> 	   of the same file
> 	I think that rsync is somewhat a "spaghetti code":
> 	look at how many extern declarations it uses....
> 	and it is crazy that this check has to be done in two separate places
> 	*/
> 	if (statret == -1) {
> 	  char *nf;
> 	  int saveerrno = errno;
> 	  nf=find_alternative_version(fname);
> 	  if ( nf != NULL)
> 	    {
> 	      statret = link_stat(nf,&st);
> 	      if (!S_ISREG(st.st_mode))
> 		statret = -1;
> 	      if (statret == -1)
> 		{
> 		  perror("stat of suggested older version failed:");
> 		  errno = saveerrno;
> 		}
> 	      else
> 		{
> 		  fnamecmp = fnamecmpbuf;
> 		  strcpy(fnamecmp, nf);
> 		}
> 	      free (nf);
> 	    }
312a351
> #endif
Index: receiver.c
===================================================================
RCS file: /cvsroot/rsync/receiver.c,v
retrieving revision 1.28
diff -r1.28 receiver.c
18a19,21
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
21a25
> 
375a380,401
> #ifndef NODEBIANVERSIONER
> 		/* by A Mennucci.
> 		   this piece will look for a previous version 
> 		   of the same file */
> 		if ((fd1 == -1)) {
> 		  char *nf;
> 		  nf=find_alternative_version(fname);
> 		  if (nf!= NULL)
> 		    {
> 		      fnamecmp = fnamecmpbuf;
> 		      strcpy(fnamecmpbuf,nf);
> 		      fd1 = do_open(nf, O_RDONLY, 0);
> 		      if(fd1==-1) 
> 			perror("file candidate");
> 		      free(nf);
> 		    }
> 		}
> 		if (fd1 != -1 )
> 		  rprintf(FINFO,
> 			  "((candidate local oldfile for %s is %s))\n",
> 			  fname,fnamecmp);
> #endif

Attachment: rsync.diffsrc.tgz
Description: GNU Unix tar archive


Reply to: