[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#128818: [patch] packages.gz diff support for apt



Jeroen van Wolffelaar wrote:
I don't know that it's such a big deal -- your general use case is a daily "apt-get update", anyway, and that'll only become moreso once those require downloading kBs instead of MBs. The other issue is that it makes server-side space requirements be squared instead of linear (you've got N patches, the most recent of which is stored N times, the oldest of which is stored 1 time). If we've got enough space for N=10, then the choice is between storing 10 days of patches Jeroen-style, or 55 days of patches (11*10/2) ordinary style. The bandwidth hit might also be obnoxious, I'm not sure.
Regarding bandwith, only among mirrors having all packages.gz files etc,
and mirrors are assumed to have plenty of space anyway (or not to mirror
these files if they don't want to).

Err, complete mirrors aren't assumed to have infinite bandwidth, and they're not assumed to have arbitrary amounts of bandwidth we can waste. Note that if you've got the "10" days of patches, the single diff per day needs downloading two files (index and patch), and removing one (10 day old patch); the complete-patch-for-each-day needs to download 10 files (that are in total 55 times the size of the other patch we're downloading). 55*30kB is ~1.6MB. I'm still not convinced that counts as obnoxious, but it's not clearly unobnoxious either (in the way 30kB is).

Regarding space requirements, you're absolutely right. It can be a bit
less (~37% actually) when the diffs have duplicate/useless information
purged, rather than simply concatted.

That seems difficult to do without keeping all the old Packages files around, which would be nice to avoid?

37% less is around 33% less, is around a 1/3rd less, 2/3rd of 55 is around 37 times, for 30kB versus 1.1MB, which still isn't real convincing.

I've run some stats over the past 8 weeks of Packages.gz files for sid's
main i386. Full datasheet (badly formatted and a bit raw) are at
http://www.wolffelaar.nl/~jeroen/pdiff.sxc (OO.o calc)

Any chance of dumping to a .csv file?

The raw daily ed-diffs are on average 50kB (ranging 30kB - 150kB), the
bzipped2 version of it on average 12kB.

What's the gzipped size? It'd probably be nicer to go with that for things small, I think?

I'll now list for the 27
november Packages.gz files (all dates are defined as the day that at
0:00 UTC those files are already available) some numbers:

For
1) number of weeks to keep on server
I will list
2) total size needed for daily ed diffs (Anthony-style) (bz2)
3) total size needed for cumulative ed diffs (Jeroen-style) (bz2)
4) total size needed for optimized cumulative ed diffs (bz2)

Total server requirements will be about 25 times that (11 architectures
times two often-changing suites (testing&sid), plus I added 10% for
sources and contrib/non-free). The figure of 25 is a bit guessed
though... I could run it for all.

25 sounds pretty fair as an estimate, though I'd expect Sources to change less than Packages (no descriptions or Depends: lines that get tweaked regularly, just Version: fields) rather than more; and not all architectures are going to be the same either, though I don't know how significant that is. How about running it on everything anyway? Three cheers for brute force and ignorance! My guess: factor of 19 or 20. Note it'll go up anyway when new architectures start getting added again.

weeks   (1)   (2)      (3)  -- x25 --> (1)     (2)     (3)
1       86kB   382kB   333kB          2.1MB   9.3MB   8.1MB
2      182kB  1315kB  1044kB          4.4MB  32.0MB  25.4MB
3      266kB  2948kB  2161kB          6.5MB  71.9MB  52.7MB
4      368kB  5221kB  3635kB          8.9MB 127.4MB  88.7MB
5      460kB  8093kB  5415kB         11.2MB 197.5MB 132.2MB
6      536kB 11591kB  7488kB         13.0MB 282.9MB 182.8MB
7      613kB 15668kB  9823kB         14.9MB 382.5MB 239.8MB
8      675kB 19589kB 12014kB         16.4MB 478.2MB 293.3MB

So, while the space requirements for this don't look too extreme, it
also shows that with less than 17MB mirror space you can keep two months
worth of ed diffs for all architectures and suites (do note that part of
the data is guessed, the number 25 as explained above).

Err, aren't you also guessing that the 1 week uses 86kB consistently? I find it hard to believe that it's /really/ that consistent.

Hrm, 8 weeks of index file isn't even such a big deal -- I use up about 120 bytes per entry, which is under 7kB for 8 weeks of daily entries.

Also worth investigating: how long does it take to apply (1), (2) and (3) after 6 to 8 weeks of changes have accumulated? I'd guess (3) should be okay, but I'd be a little worried about (1) and (2).

This is a nice idea, it combines the only one file to be downloaded with
the moderate space requirements. Implementation is a bit more tricky
indeed, though, but I don't think its prohibitly more difficult. Added
bonus is that it is just one file, where there's being prepended to:
directory listing near the packages.gz files isn't having that enormous
amount of files. On the (small) downside, prepending, but then
recompressing with bz2 makes it non-rsync friendly to transfer this big
patchfile amoungst mirrors.

Yeah -- but it's only 17MB a day in total; so big deal. And I suspect gzip --rsyncable wouldn't make it that much bigger either anyway. It's the client side implementation issues that's really tricky.

Hrm. How about two files; an index and a single concatenated patch file, where the index tells you where to start and where to finish, and you just download those bytes, and apply them? Can apt methods reliably be made to support one of "download bytes 1..N of <url>" or "download bytes M..EOF of <url>"? I guess we can trust that ./Packages.diffdex.gz and ./Packages.diff.gz will all be in sync pretty much all the time on non-broken mirrors. :-/ One file, or an index and n files would be easier to make reliable.

Hrm. 25 weeks at 2 entries a day would be 350 entries, at 160 bytes per entry would be 56,000 bytes. Crap. Okay, 10 weeks at 1 entry a day is 70 entries, at 160 bytes per entry (2 lines), gives a little over 10kB of index information. *grump* Okay, my ideal one file format is therefore:

 Patch: 2004-11-28-035413
 Patch-MD5Sum: 12311 764efa883dda1e11db47671c4a3bbd9e
 File-MD5Sum: 8112111 c1e3db8ccea4541a0f3d7e5c75feb3fb
 Ed-Commands:
  1231112c
  Version: 1.00-2
  ...

 Patch: 2004-11-27-032836
 Patch-MD5Sum: ...
   ...

except that sucks too because you don't know if you're too far out of date until you get right to the end. Blehblehbleh.

Usage scenarios:

 (a) Download one patch to get from last update to today.
 (b) Download n patches to get from n updates ago to today
 (c) Download entire Packages file because patches aren't available or
     downloading patches would be slower/larger.

(a) should be as quick as possible, since it's the common case.
(c) shouldn't require downloading too much extra information to work out it's necessary -- 20-50kB is acceptable, 500kB isn't.

So (c) implies you need to be able to quickly get a list of all the "File-MD5Sums" we have patches for. So they have to be together in some index file, or an index section of some file.

Meh, I'm putting this down to premature optimisation and going back to an index file and n patch files. They can go in some subdircectory, Packages.gz, Packages.bz2, Packages.diffs/. Whatever.

Cheers,
aj

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: