[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Better pdiff handling for apt



Salut tout le monde,

Some time ago (*cough* 2009), I had a play with working out how to
apply pdiffs more efficiently than apt currently does, and implemented
a proof of concept in python [0]. There weren't any replies (even a
"ooo, cool") when I posted to the deity list, so I left it at that;
though trolling google now, I see that it got mentioned on -devel [1]
not all that long ago (*cough* 2012), so apparently it did get read
after all. Not that it looks like it would've done much good, because
it seems like the script I put on the web, and the one I was actually
using are completely different to the extent that the one on the web
even has syntax errors. WTF?

Anyhoo... Like many, I've been getting annoyed at how tedious pdiff
application is over the past year, so over Christmas I thought I'd
have a go at patching apt to do things "properly". I've got it to the
point where it works now, and apt-get update reports things like:

   Fetched 199 MB in 50s (3919 kB/s)

over a 50kB/s link when updating testing+unstable for i386+amd64 for a
day's worth of changes, rather than reporting 7000 B/s over a megabit
link. It should also be much more pleasant on SSDs, since it only
attempts to write each diff twice (once compressed, once
uncompressed), and each packages file once; rather than writing each
diff once, and the Packages file once for each diff.

My patched apt source is up on github at:

   https://github.com/ajtowns/apt/tree/better-pdiffs

My current theory is that it's best to re-specify the .diff/Index
format by adding an additional header

   Patch-Step-Size: 1 1 4 3 2 1 1

which would indicate that there are six patches available, and that if
you start from the oldest known version, you should apply patch 1,
then patch 2 (patch 1 + 1), then patch 3 (patch 2 + 1), then patch 7
(patch 3 + 4), and then you're done (patch 7 + 1 == patch 8, which
doesn't exist). If you wanted to run an archive where you only had to
download one patch and apply it no matter how old your Packages file
was, the line would look like:

   Patch-Step-Size: 12 11 10 9 8 7 6 5 4 3 2 1

(I think a nice efficient compromise method would be Patch-Step-Size:
8 8 4 4 4 4 2 2 1, which would only update a lg(diffs) each time you
updated the packages file, and only require users to download and
apply a maximum of lg(diffs) to get up to date)

If the line's not present, it should be treated as:

   Patch-Step-Size: 1 1 1 1 1 1 1 ...

which would be compatible with the way Debian does things, but
presumably not with some derivatives.

Anyway, the code works adequately as a proof of concept and I don't
think it should need a major re-write, but the things I think still
need doing to it are:

 * make sure main sha256/sha512/etc hashes are being checked after
patches are applied
 * verify that diff sha1 hashes are being checked on download
 * don't keep diffs around after they've been applied
 * add support for Patch-Step-Size as above
 * more robust parsing of diff files in rred method
 * review code paths in acquire-item to make sure they're sane
 * fix up debug message for rred method and diff acquire-item classes
 * consider just reading the gzipped diffs directly rather than
uncompressing them
 * check that the http downloads are being done as efficiently as possible (?)
 * consider leaving the diffs in partial/
 * allow resuming downloading diffs, rather than always redownloading
them from scratch
 * add an option to only allow diffs up to a fixed total size for
embedded systems (since the diffs are loaded into memory as a whole
before being applied)
 * add/update test suite?
 * fix up any code style issues
 * fix up rred method so it can be used directly from the command line
to merge and apply ed-diffs
 * work out where ed diffs get generated these days, and provide patch
to generate Patch-Step-Size header, and optionally use rred method to
update patches to make it a bit easier to download stuff

If anyone wants to poke around, please do!

Cheers,
aj

[0] https://lists.debian.org/deity/2009/08/msg00169.html
[1] https://lists.debian.org/debian-devel/2012/02/msg00411.html

-- 
Anthony Towns <aj@erisian.com.au>


Reply to: