[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: IDEA to SERIOUSLY reduce download times!

On Sat, Jul 10, 1999 at 06:02:20PM -0400, Fabien Ninoles was heard to say:
> >   Excellent thought.  This prompted me to look up something that I
> > remembered, and I was correct.  (whew! :) ) This is not (mostly) a problem,
> > thanks to xdelta.  From the xdelta man page...
> > 
> >      Patch
> >        The patch subcommand has the following synopsis:
> > 
> >        xdelta patch [ option...  ] patch [ from [ to ]]
> > 
> >        Applies patch  to  from.   If  from  was  omitted,  XDelta
> >        attempts to use the file with the original from file name.
> >        The from file must be identical to the one used to  create
> >        the delta.  Its MD5 checksum is used to verify this condi­
> >        tion.  The constructed file will be written to  to  unless
> >        to  is  named  "-"  or  the original to file name if to is
> >        omitted.
> > 
> >   It looks like xdelta does something like this already, checking before it
> > patches that the file is correct.
> > 
> >   The one thing I'm unsure about is how this interacts with gzipped files (I
> > plan to test this :) ) -- xdelta has 'intelligent' handling of gzipped files,
> > meaning that it performs a patch on the contents rather than on the file
> > itself.  This gives better patches I suppose (I haven't personally tested it
> > but it makes sense to me :) ) but could be a slight problem, in that it can
> > change the md5sum of the file (due to changes in the compression level).  xdelta
> > probably handles this correctly (doing the md5sum on the uncompressed data) but
> > it could confuse tripwire and schemes that compare md5sums against the
> > 'official' Debian archive.
> Forcing a GZIP=-9 environment variable should be ok since the policies
> ask this for all docs and other items /in extension/. bzip2 compression
> should be more problematic however. I think the best solution should be
> to make tripwire/schemes more intelligent about this things. I don't
> know how they work so, that just speculations.

  I'll look at GZIP=-9.  Of course, as you said, the 'right' thing is to fix
tripwire if it doesn't work this way already :)

> BTW, I think that having a deb-diff should not replace orig-deb packages.
> First, it's ridiculous to ask for a complete repacking of all packages for
> people who have a full internet connection. It's a waste of disk space.
> It's also increase download time when you install new packages: you'll
> have to download both the orig-deb and the deb-diff.

  Sorry if I misunderstood you.  If I understand what you're saying now, you
mean that for package foo, the following packages should be in the archive:
  foo_1.0.0-1.deb [orig-deb version]

   (do you want the reverse patch included as well?)

   The main reason I'm not so excited about this is that even though it makes
it easier for people to jump in anywhere, they may have to download extra
patchfiles or extra packages on an upgrade.  Eg, if foo_1.0.0-3 comes out,
someone who downloaded only foo_1.0.0-2.deb will now have to either download
the full .deb or acquire a patch back to the original version and then get the
patch to the new version, whereas someone who downloaded the original version
and a patch can just get the patch.  (it may very well be a good idea to keep
the .deb file for 1.0.0-3 in the archive, especially for cd-images, but I think
apt should download the original version and the patch -- or the new version
and the backwards patch if we include backpatches..but that bloats the
  Luckily, everything I've done so far can be used in either scheme :)

> Finally, about my comment that we can always get back to download the
> full packages:
> The patching will always be done on the download phase, before
> installation. That's important for the prerm and preinst scripts.
> If the patching fails, we can optionnaly decided to go further and
> download the full package version instead.

  Right.  This looks like another good reason to keep the full package
around.. (and maybe backpatches?)

> [Patches only in diff not in full pack.]
> Also, the Patches: field should only stay in the deb-diff. Why?
> It's for apt when it has to check if he can simply download a deb-diff
> instead of the deb file. My english isn't quite good, here an
> illustration:
> User select package foo-bar 1.1-4 which is available both has a diff
> (with Patches: foo-bar (=1.1-1) ) and has a deb (without the Patches
> field).
> Apt check if foo-bar (=1.1-1) is in status. No. However, the user
> has foo-bar 1.1-3 with a tag Patches: foo-bar (=1.1-1). So apt know
> that he can remove the patches from foo-bar 1.1-3 to have a 1.1-1
> and then patches it again. However, if the user install directly
> foo-bar 1.1-3 without having foo-bar 1.1-1 previously install, apt
> doesn't have choice to install the full foo-bar_1.1-4.deb.

  Ah, ok.  That's what I was thinking.  Do you think that this extra download
of a .deb file (when they can't get any patches to the next version) is
something we should be worried about?  In the worse case I suppose that it's
no worse than the current system :)

> That's why I want to versioning the Patches: field. Look at the
> way package version appears:
> First you have a new packages with, most of the time, some little
> patches applied to it in a short period of time:
> day1: foo-bar 1.0-1
> day2: foo-bar 1.0-2 Patches: foo-bar 1.0-1
> day5: foo-bar 1.0-3 Patches: foo-bar 1.0-1
> ...
> Then, a somewhat longer time pass where the debian packages seems
> stable enough. At this time, most users have time to upgrade their
> packages to foo-bar 1.0-3. Suddenly, a new upstream version appears,
> say 1.1. What should be the better choice for the package maintainer
> to do? First thoughts can lead to:
> foo-bar 1.1-1 Patches: foo-bar 1.0-1
> foo-bar 1.1-2 Patches: foo-bar 1.1-1
> ...
> However, more chance that a lot of people have already upgrade to
> 1.0-3 so, a more economic way should be to start with
> foo-bar 1.1-1 Patches: foo-bar 1.0-3

  I see.  I'm curious about how well this kind of thing (patching from one
upstream version to another) works in terms of reasonable patch sizes,
actually.  I should tell apt not to clean my archives for a while and let a
couple of versions of packages pile up to get more test material :)

> But what for next? Inexperience maintainers will tend to think,
> with an abuse of confidence that the best thing to do should be
> to patches against the new version 1.1-1. But that's is to
> forget that most people will still be at 1.0-3 and don't even
> have the time to upgrade between the short term first debian revision.
> So, I think a more pratical should be to suggest this:
> dayX   foo-bar 1.1-1 Patches: foo-bar 1.0-3
> dayX+1 foo-bar 1.1-2 Patches: foo-bar 1.0-3
> dayX+5 foo-bar 1.1-3 Patches: foo-bar 1.0-3
> ...
> dayX+50 foo-bar 1.1-4 Patches: foo-bar 1.1-3
> where foo-bar 1.1-4 is a late appearing security patche in the
> postrm script (for example).

  Interesting.  What would be kept in the archives then?  I'm still confused
on this point (although I'm getting the feeling you have a very good and clear
idea, you just can't get it through my head :) )  As I understand it, the FTP
site would have:

(day X+5)
foo-bar_1.0-3.deb                [big archive]
foo-bar_1.0-3:1.1-3.deb-diff     [potentially-big patch]
foo-bar_1.1-3.deb                [big archive]

  I think I'd prefer:
foo-bar_1.1-1.deb                [big archive]
foo-bar_1.1-1:1.1-3.deb-diff     [probably small patch, yes I know you think
                                  this is wrong..I think in some cases it may
				  be better, though]
foo-bar_1.1-3.deb                [big archive]

  But really until we test this on a reasonably large selection of packages
and versions I don't think we can say anything about what's best.

> I will try to make a resume of all this and what need to be done
> (proposal for policies, changes need by policies, apt, etc.).

  Thanks :)

> I think I will keep the idea of a separate directory/site cause is the most
> scalable one.

  Hmm, ok.  Like [along with binary-* and source] patches-i386?  Or

> Regards,
> Fabien.
> PS: I just run into your last mail (w Python). Looks great. Maybe we can
> try to go further and make a C version so we can have both a Perl and
> a Python one, and utimately incorporate it in libdpkg/libapt... :-)
> Have a good day.

  Hmm, maybe.  I'll have to see if I can think of a clean way to do it in C
first..I think it works OK this way though; dpkg doesn't really need to know
about it (it just installs the patched files) and apt [could] just call an
external program.  (there's precedent :), it does this for dpkg) It could also
use an internal routine, of course.  Hmm.  I'll have to think more about
whether the benefit is worth going through the agony of rewriting Python code
in C :)


  PS: I've attached a diff to debfile.py that does some general fixups
(including squishing a possible bug in conffile handling--I didn't properly
 replicate the shell script's behavior :) ) and, perhaps more importantly, adds
code to look for files which were just moved from one spot to another (or
unchanged!).  This changes the file format, so I also bumped the version number
stored in the deb-diffs.  I should probably make an upgrade program..all you
have to do is create an empty file named 'moves' in the diff and change the
magic-number file to read '2.0.diff-0.2'

  The Turtle Moves!
--- old/debfile.py	Sun Jul 11 14:06:35 1999
+++ debfile.py	Sun Jul 11 17:39:54 1999
@@ -19,6 +19,9 @@
 dpkgrepackre=re.compile('^dpkg-deb: building package `[^\']*\' in `./([^\']*)\'')
 dpkgdebre=re.compile('^dpkg-deb: building package `[^\']*\' in `([^\']*)\'')
+# This is used as a magic-number to indicate the format version of the diffiles
 def parsecontrol(file):
@@ -50,6 +53,40 @@
     return control
+def parsemoves(file):
+    moves=[]
+    line=file.readline()
+    while line <> '':
+	linelist=string.split(line)
+	if len(linelist) <= 3 or len(linelist)%2 == 0:
+	    raise BadArchiveException,'Wrong number of items in movesfile: %s'%line
+	group=linelist[:3]
+	group[1]=int(group[1])
+	# Change the size to an integer
+	linelist=linelist[3:]
+	dests=[]
+	while linelist <> []:
+	    dests.append(linelist[0],int(linelist[1]))
+	    linelist=linelist[2:]
+	moves.append(tuple(group)+(dests,))
+	line=file.readline()
+    if file.close():
+	raise BadArchiveException,'Couldn\'t read movesfile'
+    return moves
+def md5totext(md5sum):
+    retval=""
+    for i in md5sum:
+	retval=retval+hex[ord(i)%16]+hex[ord(i)/16]
+    return retval
 class debfile:
     def __init__(self,filename):
 	# FIXME: do this better :)
@@ -67,6 +104,19 @@
 		    del self.contents
 		raise BadArchiveException,'dpkg failed to extract the contents of %s'%filename
 	    return self.contents
+	elif attr == 'conffiles':
+	    if os.system('ar p %s control.tar.gz | tar ztf - | grep -q conffiles'%self.filename):
+		self.conffiles=[]
+	    else:
+		conffiles=os.popen('ar p %s control.tar.gz | tar zxOf - conffiles'%self.filename)
+		self.conffiles=string.split(conffiles.read())
+		if conffiles.close():
+		    if haskey(self.__dict__,'conffiles'):
+			del self.conffiles
+		    raise BadArchiveException,'Failed to read the conffiles list for %s'%self.filename
+	    return self.conffiles
 	elif attr == 'control':
 	    # Get information about the package:
 	    dpkgrun=os.popen('dpkg-deb -f %s'%self.filename)
@@ -158,7 +208,7 @@
     def __init__(self,filename):
 	contents=os.popen('ar t %s'%filename)
-	if contents.read() <> 'debian-diff\ncontrol.tar.gz\ndata.tar.gz\ndelta.tar.gz\n':
+	if contents.read() <> 'debian-diff\ncontrol.tar.gz\ndata.tar.gz\ndelta.tar.gz\nmoves\n':
 	    raise BadArchiveException,"%s doesn't look like a Debian binary patch"%filename
@@ -166,8 +216,8 @@
 	    raise BadArchiveException,"%s doesn't look like a Debian binary patch (ar complained)"%filename
 	version=os.popen('ar p %s debian-diff'%filename).readline()
-	if string.strip(version) <> '2.0.patch-0.1':
-	    raise BadArchiveException,"%s is the wrong patch format version (%s)"%(filename,version)
+	if string.strip(version) <> diffver:
+	    raise BadArchiveException,"%s is the wrong patch format version (expected %s, got %s)"%(filename,diffver,version)
@@ -218,6 +268,18 @@
+	    for (fromfile,size,md5sum,dests) in self.moves:
+		oldfile='%s/data/%s'%(olddir,fromfile)
+		if size <> os.path.getsize(oldfile):
+		    raise PatchException,'%s is the wrong size (expected %i, got %i)'%(fromfile,size,os.path.getsize(oldfile))
+		if md5sum <> md5totext(md5.new(open(oldfile).read()).digest()):
+		    raise PatchException,'md5sum mismatch for %s, has it been modified?'%fromfile
+		for (tofile,tomode) in dests:
+		    shutil.copy('%s/data/%s'%(olddir,fromfile),'%s/data/%s'%(newdir,tofile))
+		    os.chmod(oldfile,tomode)
 	    for i in self.deltas:
 		# Not efficient to work out the deltas twice? ..maybe
 		# extractdeltas should use tar's -v option and extract
@@ -261,6 +323,19 @@
 		raise BadControlFieldException,'Failed to read the control file of %s'%self.filename
 	    return self.control
+	elif attr == 'conffiles':
+	    if os.system('ar p %s control.tar.gz | tar ztf - | grep -q conffiles'):
+		self.conffiles=[]
+	    else:
+		conffiles=os.popen('ar p %s control.tar.gz | tar zxOf - conffiles'%self.filename)
+		self.conffiles=string.split(conffiles.read())
+		if conffiles.close():
+		    if haskey(self.__dict__,'conffiles'):
+			del self.conffiles
+		    raise BadArchiveException,'Failed to read the conffiles list for %s'%self.filename
+	    return self.conffiles
 	elif attr == 'contents':
 	    # Use the contents of data.tar.gz as the contents again
 	    contents=os.popen('ar p %s data.tar.gz | tar ztf -'%self.filename)
@@ -282,6 +357,14 @@
 		raise BadArchiveException,'Failed to read the delta list of %s'%self.filename
 	    return self.deltas
+	elif attr == 'moves':
+	    # Moves are returned in the format:
+	    # (fromfile,size,md5sum,dests)
+	    # where dests is a list of names.  Maybe I should start using a
+	    # class for this?
+	    moves=os.popen('ar p %s moves'%self.filename)
+	    self.moves=parsemoves(moves)
+	    return self.moves
 	elif attr == 'size':
@@ -307,6 +390,7 @@
     between fromfile and tofile."""
     import tempfile
     import shutil
+    import md5
@@ -322,31 +406,60 @@
-	# Clone directories (unforunately we have to do two walks here; this
-	# is the first)
-	#
-	#  Bleach!! *hack* *cough*  -- here's hoping that Guido puts closures
-	# into Python 1.6, it's the only thing this language lacks..
-	#  Or maybe this is just what I get for trying to write Scheme
-	# procedures in Python :)
-	class clonedir:
-	    def __init__(self,sub,todir):
-		for file in os.listdir('%s/data/%s'%(todir,sub)):
-		    newsub='%s/%s'%(sub,file)
-		    if os.path.isdir('%s/data/%s'%(todir,newsub)):
-			mode=os.stat('%s/data/%s'%(todir,newsub))[0]
-			os.mkdir('%s/shipping/%s'%(todir,newsub),mode)
-			os.mkdir('%s/delta/%s'%(todir,newsub),mode)
-			self.__init__(newsub,todir)
-	clonedir('',todir)
-	if os.path.exists('%s/control/conffiles'%todir):
-	    for i in string.split(open('%s/control/conffiles'%todir)):
-		os.link('%s/data/%s'%(todir,i),'%s/shipping/%s'%(todir,i))
+	# Now I clone directories *and* snarf md5sums of files (to find when
+	# files have moved)
+	tofiles=[]
+	fromfiles=[]
-	# We should find identical-but-moved packages RIGHT HERE!
-	# Remember that we need to know the md5sum..
+	for i in fromfile.contents:
+	    filename='%s/data/%s'%(fromdir,i)
+	    if os.path.isfile(filename):
+		fromfiles.append( (md5.new(open(filename).read()).digest(),i,os.path.getsize(filename)))
+	fromfiles.sort(lambda x,y:cmp(x[0],y[0]))
+	for i in tofile.contents:
+	    filename='%s/data/%s'%(todir,i)
+	    if os.path.isfile(filename):
+		tofiles.append( (md5.new(open(filename).read()).digest(),i,os.path.getsize(filename)))
+	    elif i <> './' and os.path.isdir(filename):
+		mode=os.stat(filename)[0]
+		# Populate directories (may not always be needed but it makes
+		# my life easier :) )
+		os.mkdir('%s/shipping/%s'%(todir,i))
+		os.mkdir('%s/delta/%s'%(todir,i))
+	tofiles.sort(lambda x,y:cmp(x[0],y[0]))
+	moves=[]
+	while tofiles <> [] and fromfiles <> []:
+	    (tomd5sum,tofilename,tosize)=tofiles.pop()
+	    (frommd5sum,fromfilename,fromsize)=fromfiles.pop()
+	    while frommd5sum > tomd5sum:
+		(frommd5sum,fromfilename,fromsize)=fromfiles.pop()
+	    while tomd5sum > frommd5sum:
+		(tomd5sum,tofilename,tosize)=tofiles.pop()
+	    moved=0
+	    foundlist=[]
+	    while frommd5sum == tomd5sum:
+		if fromsize == tosize:
+		    print 'Old %s is identical to new %s'%(fromfilename,tofilename)
+		    foundlist.append((tofilename,os.stat('%s/data/%s'%(todir,tofilename))[0]))
+		    os.remove('%s/data/%s'%(todir,tofilename))
+		    # We don't want to accidentally make a delta for it or include
+		    # it in the final patch..
+		    moved=1
+		(tomd5sum,tofilename,tosize)=tofiles.pop()
+	    if moved:
+		moves.append((fromfilename,fromsize,md5totext(frommd5sum),foundlist))
+		os.remove('%s/data/%s'%(fromdir,fromfilename))
+	for i in tofile.conffiles:
+	    os.link('%s/data/%s'%(todir,i),'%s/shipping/%s'%(todir,i))
+	    os.remove('%s/data/%s'%(todir,i))
 	# Now make deltas for files which are in the same place and unmoved, and
 	# move new files to shipping
@@ -395,7 +508,7 @@
 	print "--stamp"
 	# Create the new archive.  First, create the stamp:
-	open('%s/debian-diff'%todir,'w').write('2.0.patch-0.1\n')
+	open('%s/debian-diff'%todir,'w').write('%s\n'%diffver)
 	# Now munge the control files to indicate patchiness [also save this
 	# information for later]
@@ -425,6 +538,19 @@
 	if os.system("cd %s/delta && tar czf ../delta.tar.gz *"%todir):
 	    raise ArchiveCreationException,"Couldn't create delta.tar.gz"
+	print "--moves"
+	movefile=open('%s/moves'%todir,'w')
+	for i in moves:
+	    dests=i[-1]
+	    movefile.write('%s %s %s'%i[:-1])
+	    for file in dests:
+		print file
+		movefile.write(' %s %i'%file)
+		# Note that we include the permissions in the data written..
+	    movefile.write('\n')
+	movefile.close()
 	if fromfile.has_key('Architecture'):
@@ -444,7 +570,7 @@
-	if os.system('ar rc %s %s/debian-diff %s/control.tar.gz %s/data.tar.gz %s/delta.tar.gz'%(outfile,todir,todir,todir,todir)):
+	if os.system('ar rc %s %s/debian-diff %s/control.tar.gz %s/data.tar.gz %s/delta.tar.gz %s/moves'%(outfile,todir,todir,todir,todir,todir)):
 	    raise ArchiveCreationException,'Couldn\'t create %s'%outfile
 	# Clean up

Reply to: