[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: IDEA to SERIOUSLY reduce download times!



On Wed, Jul 07, 1999 at 01:39:38AM -0400, Steve Dunham was heard to say:
> > Some variations are rsyncing the files in a gzip .deb using the local
> > files and other stuff like that - all very doable, but difficult.
>
> At one point in time, as an experiment, I wrote a perl script that
> extracted two rpm files, generated lists of new, modified, and deleted
> files, stored the list of deleted files, along with a tar.gz of the
> new files and a tar.gz of xdelta's of the modified files.  (I think I
> also tried detecting moved files via md5sums.)  I stuffed the
> resulting pieces into a .ar archive and compared the sizes to the new
> packages.  (My sample was the then current updates for RedHat - I
> can't remember the exact version.)  I ended up with a net savings of
> about 50%.  Not too shabby, but to me it didn't justify moving away
> from the simpler status-quo of just distributing new packages.

  Out of curiosity, I decided to fiddle with this a little.  I made a
short shell script that creates a file with, I believe, [almost] all the
information necessary to move from one version of a package to another.  I
don't have enough packages lying around to test much :-), but in one case I
have on hand (where I only changed/added a few text configuration files) I
got something similar to your results for ncurses.  I think that it is
worth the trouble, though.

> According to google, my original post is in the April 1998 archives of
> rpm-list, but the links are dead because they are reorganizing
> archive.redhat.com.
> 
> ok, a power search of "dunham xdelta rpm" on dejanews does the trick.
> 
> :  unpacks the cpio part of two rpm archives
> :  constructs a file list with md5sum's in perl hashes
> :  identifies deleted, renamed, new and changed files
> :
> :  generates an xdelta of each changed file
> :  generates a list of delete and rename operations (gzipped)
> :  builds a cpio.gz of the new files
> :  builds a cpio of the xdelta files (xdelta files are already compressed)
> :
> :  makes an "ar" archive of the results of the previous three steps
> 
> : Package        Orig Ver  O.Size  New Ver N.Size Diff Size
> : =============  ========= =====  ======== =====  =====
> : ncurses        1.9.9e-6   524k  1.9.9e-8  524k    47k
> : ncurses-devel  1.9.9e-6   382k  1.9.9e-8  382k     4k
> : perl           5.004-1   3128k  5.004-4  3260k  1186k
> : util-linux     2.7-11     297k  2.7-15    344k    57k
> : pine           3.96-3     895k  3.96-7    897k   281k
> : mh             6.8.4-4   1152k  6.8.4-4  1156k   489k
> : glibc          2.0.5c-10 2559k  2.0.7-6  3855k  2149k
> : =====================================================
> : Total                    8937k          10418k  4213k

  I download things over a 28.8k line.  Being able to reduce the download by
as considerable an amount as most of those patches allow would be incredibly
helpful.  10MB is a huge download, 4MB is halfway manageable (~20 minutes,
I believe)

  Here's how my script works (pretty much the same as yours I think):

  -> unpack the data.tar.gz member of both .debs
  -> unpack the control section of the new .deb
  -> create two temporary staging areas called 'delta' and 'shipping'
  -> move all conffiles for the new .deb into 'shipping'
  -> for all regular files in the new version that also exist as regular files
    in the old version, perform an xdelta and save the result as a file with
    the same name relative to 'delta'.  That is, a delta for usr/doc/README
    will be stored in 'delta'/usr/doc/README .  If the delta is larger than the
    new file, delete it.
     Files which do not have a delta generated (either because they are links,
    because the delta was larger than the file itself, or because there's no
    corresponding file in the original package) are moved to 'shipping'.
  -> data.tar.gz is overwritten with the contents of 'shipping'
  -> delta.tar.gz is created with the contents of 'delta'
  -> An ar archive is created which contains debian-binary, control.tar.gz,
    data.tar.gz, and delta.tar.gz.  [ perhaps I should modify debian-binary in
    the new archive]

  This is the easy bit :-)  All that's still needed is clean handling of
versioning.

  I still cannot find a clean way to actually apply the patches.  Ideally, it
would be quite simple: you would execute dpkg --install on the new file.  In
the 'unpack' phase of dpkg, dpkg would unpack data.tar.gz as usual, but then do
an 'xdelta patch' for all contents of delta.tar.gz, creating backups of
originals as with data.tar.gz [I don't actually know what mechanism is used
normally for this; are the old files renamed or do the new ones get .dpkg-new
appended, or is something else done?].  This way, if something goes horribly
wrong in the patching you can complain about an error and restore things to the
way they were.  There would have to be a way to indicate patching information
elsewhere, of course.  Perhaps a Patches: control item could be added; I don't
know what would be done about Packages.gz and apt, or whether distributing
patches on the FTP mirrors is a good use of space.

  Of course, that's a pipe dream :-)  I've also considered hackery using
preinst scripts to do the patch [and therefore having to include delta.tar.gz
somewhere inside data.tar.gz] but this would get nasty -- in order to have
dpkg's file list come out correctly, data.tar.gz would have to contain entries
for all files that were really in the package.  Another option (probably the
best for now) is to use dpkg-repack: create a temporary 'old' package, extract
it and the new data.tar.gz to a temporary directory, apply the patches and
copy the patched files to where the new data.tar.gz was extracted, recreate
data.tar.gz with the patched files, and create the new deb from debian-binary,
control.tar.gz, and the rebuilt data.tar.gz .

  Anyway, no more time to think about this at the moment :-)

  Daniel

-- 
  "I've struggled with reality for thirty-five years, but I'm glad to say that
   I finally won."
     -- _Harvey_
#!/bin/sh

#  Makes a delta between two Debian packages
#
#  Call as makepatch [fromfile] [tofile]
#
#  Note that this preserves the new DEBIAN/ directory [control.tar.gz]

abort()
{
  rm $TMPDIR -rf
  exit $1
}

if [ "$#" != 2 ]
then
  echo "Error: $0 must be called with two arguments"
fi

TMPDIR=/tmp/makepatch_$$

mkdir $TMPDIR || abort -1

FROMDIR=$TMPDIR/from
TODIR=$TMPDIR/to

# Extract the packages:
# <dir>/data is the original data.tar.gz
# <dir>/delta is where we store the hierarchy of deltas
# <dir>/shipping is what's left of the data.tar.gz :-)
# <dir>/control is for the control files (we don't modify this but I need to
#              see the new conffile list)
mkdir $FROMDIR $TODIR &&
mkdir $FROMDIR/data $TODIR/{data,shipping,control,delta} &&
(cd $FROMDIR && ar x $1) &&
(cd $TODIR   && ar x $2) || abort -1

if [ "${1#*.deb}" != "" ]
then
  echo "Warning!  $1 does not end in .deb, filename guessing will be confused!" 1>&2
fi

if [ "${2#*.deb}" != "" ]
then
  echo "Warning!  $2 does not end in .deb, filename guessing will be confused!" 1>&2
fi

FINALNAME="`basename ${1%.deb}`:`basename ${2%.deb}`.deb-diff"

if [ `cat $FROMDIR/debian-binary` != "2.0" ]
then
  echo "Warning!  $1 is not in Debian-binary-2.0 format.  Bad Things may happen!" 1>&2
fi

if [ `cat $TODIR/debian-binary` != "2.0" ]
then
  echo "Warning!  $2 is not in Debian-binary-2.0 format.  Bad Things[tm] may happen!" 1>&2
fi

echo "Extracting archives..."

(cd $TODIR/control && tar zxf ../control.tar.gz) &&
(cd $TODIR/data && tar zxf ../data.tar.gz) &&
(cd $FROMDIR/data && tar zxf ../data.tar.gz) || abort -1

CONFFILES=

if [ -f $TODIR/control/conffiles ]
then
  CONFFILES=`cat $TODIR/control/conffiles`
fi

echo "Populating directories.."

for dir in `find $TODIR/data -type d -printf '%P\n'`
# Reproduce the directory structure
do
  echo $dir
  mkdir $TODIR/{delta,shipping}/$dir || abort -1
done

echo "Moving conffiles.."

for file in $CONFFILES
# Don't do deltas on the conffiles, let dpkg handle them!
do
  echo $file
  mv $TODIR/data/$file $TODIR/shipping/$file || abort -1
done

echo "Calculating deltas.."

for newfile in `find $TODIR/data -type f -printf '%P\n'`
do
  if [ -f $FROMDIR/data/$newfile ] && (xdelta delta $FROMDIR/data/$newfile $TODIR/data/$newfile $TODIR/delta/$newfile ; [ `find $FROMDIR/data/$newfile -printf %s` -gt `find $TODIR/delta/$newfile -printf %s` ] )
  then
    echo "Delta created for $newfile"
    rm -f $TODIR/data/$newfile || abort -1
  else
    echo "No delta created for $newfile"
    rm -f $TODIR/delta/$newfile &&
    mv $TODIR/data/$newfile $TODIR/shipping/$newfile || abort -1
  fi
done

# Create the patched archive

echo "Creating archive:"
echo "data..."
(cd $TODIR/shipping && tar czf ../data.tar.gz *) || abort -1
echo "delta..."
(cd $TODIR/delta && tar czf ../delta.tar.gz *) || abort -1
# Leave the control section alone

(cd $TODIR && ar r "/tmp/$FINALNAME" debian-binary control.tar.gz data.tar.gz delta.tar.gz) || abort -1

echo "Done, patchfile is in /tmp/$FINALNAME"

rm $TMPDIR -rf
# Clean up

Reply to: