[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New method for Packages/Sources file updates

Thiemo Seufer <ica2_ts@csv.ica.uni-stuttgart.de> writes:

> Goswin von Brederlow wrote:
> [snip]
>> >> Following Seteves idea of marking removals in the Packages file itself
>> >> I've come up with the following:
>> >
>> > I think the whole approach is needlessly complicated. Could you
>> > read the idea covered in the thread starting at
>> > http://lists.debian.org/debian-devel/2004/07/msg00128.html
>> > and explain what's wrong with it in your opinion?
>> >
>> >
>> > Thiemo
>> + no performance difference for daily updates (+- a few bytes)
> + Full backward compatibility. Changing the Packages format is likely
>   to break some tools.

One can say I'm not changing the format, the format is still rfc822
formated text. The only other requirement I've seen so far is apt's
requirement that entries start with "Package:"

>> - preformance penalty for repeated patching of the same package
>>   (e.g. the zsh-beta upload every odd day)
>> - compression penalty due to lots of small files instead of one big
>>   one from gzip, even worse with bzip2
>> - performance penalty due to lots of small files instead of one big
>>   one from apt-method, forking gunzip, forking patch
> Client-side performance is mostly irrelevant. Also, this particular
> set of problems can be solved by using cumulative diffs instead of
> several incremental ones.

The number of ftp connections needed is highly relevant. For http the
penalty isn't that big but still adds up.

With cumulative patches you run into the problem that you need a new
cummulative patch for every day that contains most of what the
previous one did. That realy quickly becomes a space issue.

>> - multi pass method where a failure in any one of them is fatal
> Failure isn't fatal, it just triggers fallback to the full
> Packages file. Btw, this can also be solved by cumulative diffs.

It is still more likely to fail imho and failure is anoying.

>> - timestamp on package is timezone/clock dependent but the index
>>   should protect that
> You haven't read the complete thread. The timestamp problem can be
> avoided.
>> - extra space needed for the diff files
> Which is minimal in comparision to the archive size.

Not for something like snapshots.debian.net. They do have a tad more
Packages files than debian has. And why waste even a byte if it is
absolutely not needed to achive the same?

>> - limited update interval to stop the extra space from exploding,
>>   2 weeks suggested
> Rather a heuristics based on patch sizes << Packages size and the
> number of update cycles. The absolute timespan isn't a good measure,
> just think about the typical update cycles for unstable, stable and
> security/stable.

Think about unstable main. That is where most of the updating (user
and archive) happens and most of the benefit will come from.

The amount of new packages for October is 691Kb as gzip. That is still
less than 20% of the full file. Providing update intervals of over a
month for unstable is still worth it. That is over 30 diff files in
your case and then multiple updates of the same packages will
cummulate in the diffs.

For stable and especially security the amount of change will be even
less and even more diff files would still be worth it. The size would
be smaller but the number of files higher.

>>   while my method can cover full releases for a few
>>   K extra
> True. OTOH, covering many update cycles isn't that useful for typical
> use.
>> - new files that mirrors won't pick up for a long time,
>>   can only be used on mirrors that are reconfigured to mirror diffs too
> I think full mirrors alredy cover the whole directory contents, so this
> is only a problem for partial mirrors. It's also non-fatal, as it falls
> back to the current method (failing index file accesses are the only
> difference).
> Reduced server load provides an incentive to those mirror admins to
> change their scripts.

ftp.de.debian.org doesn't seem to pick up new meta files from the
amd64 archive. I tried adding some there to test both methods (sorted
and diffs) against each other. This is no proof it won't do so for
debian but the mirror script / options are likely to be identical.

>> - no benefit for rsync or zsync
> True. OTOH, low bandwith users are unlikely to update their machines
> via rsync.
>> - not applicable (due to number of files) to archives with hourly
>>   updates (like amd64, and we might even do 15m updates to prevent
>>   Build-Depends stalls)
> This suggests interested parties do frequent updates anyway. This
> eventually allows to shorten the timespan covered, which means the
> number of files won't increase much.

Not realy. The buildd will do an update before each package build
(usualy just getting a HIT). That does not mean that users will do any
more frequent updates than now.

>> - probably unusable on snapshots.debian.net like archives with tons of
>>   Packages files due to too many tiny files
> Which is a good thing, since archived Packages files aren't supposed
> to get updated. :-)

I ment those files:

| apt-get specific package(s)
|  deb http://snapshot.debian.net/archive pool package ...
|  deb-src http://snapshot.debian.net/archive pool package ...
| where package is source package name as debian/pool directories.

They have quite a lot of those. :)

>> Need any more? :)
> Yes.
> Thiemo

Do you have any benefits for diffs apart from applying them is
simpler? Maybe that convinces. Creating them certainly isn't simpler
(algorithmically speaking).


Reply to: