[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#128818: [patch] packages.gz diff support for apt



On Thu, Nov 25, 2004 at 02:53:04PM +1000, Anthony Towns wrote:
> Michael Vogt wrote:
> >I wonder if the idea of Jeroen van Wolffelaar to
> >use only one ed-style diff is workable. It would indeed have a much
> >better performance for the client. 
> 
> I don't know that it's such a big deal -- your general use case is a 
> daily "apt-get update", anyway, and that'll only become moreso once 
> those require downloading kBs instead of MBs. The other issue is that it 
> makes server-side space requirements be squared instead of linear 
> (you've got N patches, the most recent of which is stored N times, the 
> oldest of which is stored 1 time). If we've got enough space for N=10, 
> then the choice is between storing 10 days of patches Jeroen-style, or 
> 55 days of patches (11*10/2) ordinary style. The bandwidth hit might 
> also be obnoxious, I'm not sure.

Regarding bandwith, only among mirrors having all packages.gz files etc,
and mirrors are assumed to have plenty of space anyway (or not to mirror
these files if they don't want to).

Regarding space requirements, you're absolutely right. It can be a bit
less (~37% actually) when the diffs have duplicate/useless information
purged, rather than simply concatted.

> I'd be interested in seeing how that actually ends up looking for 
> unstable and testing, though.

I've run some stats over the past 8 weeks of Packages.gz files for sid's
main i386. Full datasheet (badly formatted and a bit raw) are at
http://www.wolffelaar.nl/~jeroen/pdiff.sxc (OO.o calc)

The raw daily ed-diffs are on average 50kB (ranging 30kB - 150kB), the
bzipped2 version of it on average 12kB. I'll now list for the 27
november Packages.gz files (all dates are defined as the day that at
0:00 UTC those files are already available) some numbers:

For
1) number of weeks to keep on server
I will list
2) total size needed for daily ed diffs (Anthony-style) (bz2)
3) total size needed for cumulative ed diffs (Jeroen-style) (bz2)
4) total size needed for optimized cumulative ed diffs (bz2)

Total server requirements will be about 25 times that (11 architectures
times two often-changing suites (testing&sid), plus I added 10% for
sources and contrib/non-free). The figure of 25 is a bit guessed
though... I could run it for all.

weeks   (1)   (2)      (3)  -- x25 --> (1)     (2)     (3)
1       86kB   382kB   333kB          2.1MB   9.3MB   8.1MB
2      182kB  1315kB  1044kB          4.4MB  32.0MB  25.4MB
3      266kB  2948kB  2161kB          6.5MB  71.9MB  52.7MB
4      368kB  5221kB  3635kB          8.9MB 127.4MB  88.7MB
5      460kB  8093kB  5415kB         11.2MB 197.5MB 132.2MB
6      536kB 11591kB  7488kB         13.0MB 282.9MB 182.8MB
7      613kB 15668kB  9823kB         14.9MB 382.5MB 239.8MB
8      675kB 19589kB 12014kB         16.4MB 478.2MB 293.3MB

So, while the space requirements for this don't look too extreme, it
also shows that with less than 17MB mirror space you can keep two months
worth of ed diffs for all architectures and suites (do note that part of
the data is guessed, the number 25 as explained above).

If you're going to support only a week or something however, it doesn't
matter much. As Anthony Town's suggestion scales much better, I do
suggest to go for the index file. Daily updates will be the most common,
and with this index file and http connection reusing, you can quite
efficiently download all patches you need.
 
> I'm half tempted to suggest thinking about an annotated patch file, that 
> looks like:
> 
> 	patch-for abcdef12341231def1123 4123 2004-11-23-131421.1234
> 	* a 31
> 	* blahblah
> 	* .
> 	patch-for a4234534562bce123423f ...
> 	* ...
> 
> that concatenates all the information for the patches in a single file, 
> most recent to least recent with some index stuff at the top, and you 
> just stop downloading once you've got enough information, or you find 
> out it's not going to work. Might be overly complicated though.

This is a nice idea, it combines the only one file to be downloaded with
the moderate space requirements. Implementation is a bit more tricky
indeed, though, but I don't think its prohibitly more difficult. Added
bonus is that it is just one file, where there's being prepended to:
directory listing near the packages.gz files isn't having that enormous
amount of files. On the (small) downside, prepending, but then
recompressing with bz2 makes it non-rsync friendly to transfer this big
patchfile amoungst mirrors.

--Jeroen

-- 
Jeroen van Wolffelaar
jeroen@wolffelaar.nl
http://jeroen.A-Eskwadraat.nl



Reply to: