[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New method for Packages/Sources file updates



Thiemo Seufer <ica2_ts@csv.ica.uni-stuttgart.de> writes:

> Goswin von Brederlow wrote:
> [snip]
>> >> - preformance penalty for repeated patching of the same package
>> >>   (e.g. the zsh-beta upload every odd day)
>> >>
>> >> - compression penalty due to lots of small files instead of one big
>> >>   one from gzip, even worse with bzip2
>> >>
>> >> - performance penalty due to lots of small files instead of one big
>> >>   one from apt-method, forking gunzip, forking patch
>> >
>> > Client-side performance is mostly irrelevant. Also, this particular
>> > set of problems can be solved by using cumulative diffs instead of
>> > several incremental ones.
>> 
>> The number of ftp connections needed is highly relevant. For http the
>> penalty isn't that big but still adds up.
>> 
>> With cumulative patches you run into the problem that you need a new
>> cummulative patch for every day that contains most of what the
>> previous one did. That realy quickly becomes a space issue.
>
> Errm, no, it doesn't need _one_ new cumulative patch. All the
> previously made cumulative diffs need to be updated.

I was thinking of a

-1day.diff
-2day.diff
-3day.diff
...

So every day a new file appears at the end and contains most of what
all the others already contain.

Updating those cummulative diffs is also either inefficient (cat the
daily diffs together), very hard (figure out how to make a minimal diff
from the daylies) or you need every days Packages file (apt-dupdate
does that).

Having to store and diff every past days Packages file is a huge
resource drain and can't be done for more than a couple of days, maybe
up to 2 weeks.

Ask the apt-dupdate author for how long it takes every night and how
much disk space it uses.

> If we assume to hold 14 update cycles, have a cutoff if the size of
> the cumulative diff exceeds the size of the Packages file, and have
> linear growth of the diffs, then the additional space used is at most
> seven times the size of the Packages file. Normally it will be much
> less, because large archives don't thend to change that quickly.

14 update cycles is a limitation on the process and isn't needed with
sorted Packages files.

Also how do you get 'seven times'? Say every day one package changes
bt on the last nearly ever package changes. That means all 14
cummulative diffs will be the size of the Packages file (change as
many packages as possible but so that all stay below the cutoff).

That would be nearly 14 times the space.

> [snip]
>> >> - extra space needed for the diff files
>> >
>> > Which is minimal in comparision to the archive size.
>> 
>> Not for something like snapshots.debian.net. They do have a tad more
>> Packages files than debian has. And why waste even a byte if it is
>> absolutely not needed to achive the same?
>
> Again, snapshots shouldn't have any need for updating a snapshot.

Yes they do. Every time a new version of a Package is released the
Packages file updates. And it never gets smaller. Those would be
perfect for date sorted.

>> > Rather a heuristics based on patch sizes << Packages size and the
>> > number of update cycles. The absolute timespan isn't a good measure,
>> > just think about the typical update cycles for unstable, stable and
>> > security/stable.
>> 
>> Think about unstable main. That is where most of the updating (user
>> and archive) happens and most of the benefit will come from.
>> 
>> The amount of new packages for October is 691Kb as gzip. That is still
>> less than 20% of the full file. Providing update intervals of over a
>> month for unstable is still worth it. That is over 30 diff files in
>> your case and then multiple updates of the same packages will
>> cummulate in the diffs.
>
> No, they won't if cumulative diffs are used.

Tell me how you plan to create the 30 cumulative diffs each
day. Storing the Packages files as plain text wastes too much
space. bunzip2ing them every night takes too long. Just diffing them
is also not that fast.

Or for 60 days, which would still be <50% the size.

>> For stable and especially security the amount of change will be even
>> less and even more diff files would still be worth it. The size would
>> be smaller but the number of files higher.
>
> I can't follow you. stable would have three additional diffs by now.

stable-proposed-updates

What I mean is that each change is very small. So the diff files don't
grow much and a large amount of diffs is still below the size of the
Packages file.

It is not like sid where you have 100+ package changes every day.

> For stable-security I assume it's either tracked closely or very
> infrequently. Providing a slightly faster update in the latter case
> doesn't seem to be worthwile.

The date sorted method gets it for free.

>> >> - not applicable (due to number of files) to archives with hourly
>> >>   updates (like amd64, and we might even do 15m updates to prevent
>> >>   Build-Depends stalls)
>> >
>> > This suggests interested parties do frequent updates anyway. This
>> > eventually allows to shorten the timespan covered, which means the
>> > number of files won't increase much.
>> 
>> Not realy. The buildd will do an update before each package build
>> (usualy just getting a HIT). That does not mean that users will do any
>> more frequent updates than now.
>
> Then a "newly built" archive should probably be used for the buildd,
> sparing users/mirrors from the inconvenience of an archive which is
> almost always update_in_progress.
>
> I think the official buildds use incoming.d.o for that.

Yes.

>> >> - probably unusable on snapshots.debian.net like archives with tons of
>> >>   Packages files due to too many tiny files
>> >
>> > Which is a good thing, since archived Packages files aren't supposed
>> > to get updated. :-)
>> 
>> I ment those files:
>> 
>> | apt-get specific package(s)
>> | 
>> |  deb http://snapshot.debian.net/archive pool package ...
>> |  deb-src http://snapshot.debian.net/archive pool package ...
>> | 
>> | where package is source package name as debian/pool directories.
>> 
>> They have quite a lot of those. :)
>
> Any reason why those tiny files should ever get any download
> optimization?

Why not?

>> >> Need any more? :)
>> >
>> > Yes.
>> >
>> >
>> > Thiemo
>> 
>> Do you have any benefits for diffs apart from applying them is
>> simpler?
>
> They keep a backward compatible Packages file which is proven to
> work with old tools. Furthermore, updating on the server side
> can be done by a simple script which invokes diff a few times.
>
> The latter is especially interesting for partial mirror scripts
> which usually fail to implement a decent parser for Packages files.

How would a diff be better for a mirror script that doesn't parse
Packages files? You still need a Packages file parser. You lost me
there.

> Thiemo

MfG
        Goswin



Reply to: