Re: Hashsum mismatch prevention strategies
- To: Raphael Geissert <firstname.lastname@example.org>
- Cc: email@example.com, firstname.lastname@example.org
- Subject: Re: Hashsum mismatch prevention strategies
- From: Goswin von Brederlow <email@example.com>
- Date: Fri, 01 Jun 2012 10:37:05 +0200
- Message-id: <firstname.lastname@example.org>
- In-reply-to: <email@example.com> (Raphael Geissert's message of "Mon, 21 May 2012 11:33:04 -0500")
- References: <CAAZ6_fBOSg02os-MkAbn-qojggT2Bw7fC8t-zkMYBPB21JSOsg@mail.gmail.com> <20120520183006.GA30706@debian.org> <20120521090549.GA4857@debian.org> <firstname.lastname@example.org>
Raphael Geissert <email@example.com> writes:
> On Monday 21 May 2012 02:12:06 Julian Andres Klode wrote:
>> On Sun, May 20, 2012 at 06:30:06PM +0000, Raphael Geissert wrote:
>> > Goswin von Brederlow wrote:
>> > I'm not even sure a new field needs to be introduced. It's just a
>> > matter of stating that the fields are ordered and if you have hash X
>> > you need all the patches mentioned in that line and the ones that
>> > follow.
>> Assuming that would break reprepro repositories.
> Alright, so a new field *is* needed.
>> > Additionally, I'd like an alternative way to distribute the pdiffs to
>> > be considered: after n days (say 2), gunzip the patches worth of one
>> > day and tgzip them. This not only reduces the number of requests
>> > needed to download all the files, but it also provides better
>> > compression. Adding an Index file to the tarball would be enough for
>> > apt to know which ones to apply and which ones to ignore.
>> A tar in between would complicate the code on the client, and break
>> backwards compatibility.
> Until recently the archive would not provide enough diffs to allow Packages
> files older than a couple of days to be reused. Returning to that behaviour
> for clients that don't support tar-ed diffs wouldn't affect much. I even dare
> to especulate that clients who have an unmetered connection will welcome
> that, as it currently takes considerably more time to download 10 or so
> pdiffs that downloading the whole Packages file.
> The whole indexdiff logic could be moved away from apt-pkg to a method that
> does the right thing. The main bit that is missing is the ability for a
> method to issue sub-requests without instantiating the method directly
> (which is what the mirror method does.)
> How does that sound?
Not ideal and also just not needed.
The pdiff files can easily be merged so there is no need to involve tar
there. Also, since we need a new field (as you say above) it makes sense
to choose the new field in such a way that we can make everyone happy.
My earlier suggestion was to modify the Index format so that each line
states all the pdiff files needed to go from the given checksum to the
A better suggestion aparently had been made long before and forgotten
again. Add a new field that for each patch states the checksum of the
resulting patched file. The client can then lookup the checksum of
Packages to find the patch, lookup the patch to find the expected
resulting checksum, repeat till it reaches the checksum of the current
file and thereby build up a list of pdiff files it needs to download.
Now the beauty of this is that is remains compatible with existing
clients (which should all be ignoring the new unknown field), supports
DAKs incremental diffs, reprepros single-step diffs and anything
inbetween. That includes having a mix of incremental and merged diffs,
0 < 1 2 < 3 4 < 5 6 < 7 8 < 9 10 < 11 12 < 13 14 < 15
\___/ / \___/ / \___/ / \_____/
\_______/ / \_________/
A mix like the above would already cut the number of PDiff files from n
to log(n) and the new index field would allow pipelining the patches.
Although just pipelining the patches and applying them in a single go is
probably nearly as fast.
Wether to use incremental, single-step or a mix is a question of
size. With Debians current PDiff depth single-step diffs would increase
the size by factor 27 so I think a mix is a good compromise. At least
till PDiff files can be pipelined and purely incremental diffs aren't
dead slow anymore.
>> > What I had proposed on irc was a combination of a) and b), sort of:
>> > Let's call it option D:
>> > * Only the InRelease files have a constant name
>> > * All the other files have a date or some other sort of serial number
>> > appended, e.g. Packages-12042014
>> > * Compatibility symlinks are kept in place, but it is known they will
>> > be prone to race conditions (404s, even).
>> > * APT and others find the names of the latest available indexes from
>> > the InRelease file
>> > * InRelease becomes the one and only place at which a mirror "switches"
>> > from one push to another.
>> Which is basically the same I proposed as well [and what is option
>> C from the Ubuntu discussion].
And still fails if the InRelease is updated before the Packages-12042014
> Right, I had not read your email by then. Using a hash is probably overkill
> and if there's even the possibility of gaining something by using --fuzzy,
> it would be killed by using such a naming format.
> According to rsync's code, it uses a modified Levenshtein edit distance, with
> a limit of 25 edits. Normally, one could expect the edit distance of two
> hashes to be near the length of the hash.