[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New method for Packages/Sources file updates



Steve McIntyre <steve@einval.com> writes:

> Goswin writes:
>>
>>The date sorted Packages file and the removal file on the other hand
>>are very easy to generate (possibly by katie itself) and only need a
>>few extra K for the removals. It is also about twice as efficient
>>since removals of packages are mostly implicit (and complete removals
>>only use the package name).

After some talk on irc I have to correct this a bit. The diff files
would be ed scripts and not unified diffs, they would not contain the
old entries as - lines. The size for the diff file and my method
should be near identical. So download wise both methods don't differ
that much unless the same packages get changed over and over and
reapear in every diff. My method would only download the last
change. For weekly/monthly updates this might be a noticeable
difference. [The diff files could be optimized for this but the only
trivial way for that I know needs all the old Packages files.]

> I covered exactly this suggestion in my blog a while back[1,2] and
> spoke to James and Colin and others about it then. There are issues
> with it, though. Th existing infrastructure doesn't contain the needed
> information in a useful format (e.g. the date a package first entered
> testing). After some discussion I was convinced that the diff method
> _is_ better; it just needs implementing. I've been busy and haven't
> looked any further since, much to my shame...

You don't need the actual time. What I do is:

Packages file:
- read in the old packages file and the new packages file.
- output any entries only in the new Packages file.
- output entries in both Packages files.

Removal file:
- output any package name in the old Packages but not in the new one
- append the old removal file

If it is too hard for the database to remember the time (and for
stable/testing that sounds very plausible) this post-processing can be
trivialy added.

All of this is done with two merge-sort O(n log n) and linear
comparisons of the lists O(n). Takes around a second on my amd64 for
main.

> [1] http://blog.einval.com/2004/08/08#packages-files

| One issue that this does not cope with is removed packages and
| sources. There is an easy way to do that too: add a new small stanza
| for a binary/source package with a new Removed-time: field. When the
| client sees this stanza, it will know to remove older information
| about that package/source from its merged output.

I like that idea. The stanza could probably be made in a way that
prevents old clients (unaware of the Removed-time meaning) for seeing
this as a real package.

Combine this with my idea to download until you see a stanza you
already have (instead of the timestamps) and the post processing of
Packages file (to sort by date) and you avoid the problem of having to
know when a Package entered testing and having the extra removals file.

Drawback of this would be that removal stanzas could only be removed
if they are the last stanza in the Packages file without triggering a
(detectable) algorithm failure and the need to download the rest of
the Packages file. .oO(But people updating more than say 6 month apart
can download the full file. So this is no reason not to do it.)

I think I will incorporate this into my genpackage binary and run it
over the last months Packages file.

> [2] http://blog.einval.com/2004/08/19#packages-files2

| But Daniel mentioned earlier that simply MD5ing the client's
| existing Packages file and having it ask for diff.<MD5> from the
| server should do the job. Fine, I see how that works.

That would require rebuilding the diffs for <MD5> every day or for the
server to concatenate the needed daily diffs from that md5 till today
on the fly. Rebuilding would mean mirrors redownload the same data
over and over every day while concatenating needs server changes on
the mirrors.

An index of md5sum and patch to the next day would work and the client
would then work through that index till it reaches todays md5sum.


Personaly I too prefer the sorted Packages file. None of the diff
faction seem to think past a forthnight between updates. The diff way
grows by O(<changes> * <max. days supported>) while sorted Packages
files only grow by O(<complete removals>) of which there are very few.

>
> -- 
> Steve McIntyre, Cambridge, UK.                                steve@einval.com
> Who needs computer imagery when you've got Brian Blessed?

MfG
        Goswin

OT: [2] mentions 'Final sarge edition of debian-cd uploaded'. Did that
include amd64 support?



Reply to: