[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hashsum mismatch prevention strategies



Joerg Jaspert <joerg@debian.org> writes:

> On 12843 March 1977, David Kalnischkies wrote:
>
>> Archive updates are happening faster and faster nowadays - debian is at
>> 6 hours, ubuntu at times at 1 hour
>
> Im pretty sure Debian won't go down below 6 hours any time soon and if
> we really do so, I wouldn't think hourly is a sane thing to do for an
> archive like ours. Anyways, different topic.
>
>>- users and scripts alike are presented
>> more and more often with messages from their package managers claiming
>> that the metadata doesn't match their expectations.
>> (aka: Release files refers to an older/newer Packages/Sources/â?¦ file)
>
> Currently the translation breakage, though the latest ftpsync I released
> yesterday should fix this a bit for the user experience.
>
>> To a certain point these problems can be avoided by deploying more and
>> more complex mirrorsync scripts with more or less atomic behavior.
>
> Yah, we need a better dists/ and not even more logic in ftpsync.
>
>> This doesn't help at all if we talk about round-robins thoughâ?¦
>
> Oh sure, just use staged pushing from ftpsync. :)

I think he ment that you get InRelease from one mirror (which has
already updated) and Packages from another (which hasn't).

>...
> - Saner diffs. Now that one is a "fun" one, I know, but having something
>   where you don't need to jump through dozens of very small files to end
>   up with the final result, but have one and out comes the result, for
>   example would be one thing. The sheer number of small .diffs makes it
>   unusable as soon as you have large bandwidth. It would be nice(r), i
>   think, if we could have something that lets you go from "x days ago to
>   $now".

Just doing a quick check using the existing 56 incremental pdiff files
from sid/main/binary-amd64/Packages.diff/ as basis:

    total incremental size (56 files): 3.3MB (1162KiB gzip)
    size of single patch (rredtool --merge *): 2.6MB (817Kib gzip, 652KiB xz)

A single one step patch is noticeably smaller do download. BUT

    total single patch size (56 patches): 27MB gzip

Because each single patch is bigger than the next the size just
explodes. The size required grows O(n^2) with the number of patches. And
don't forget that for every update all those patches need to be updated
and mirrored again. So that would be an extra 300MiB every 6 hours and
in the critical dist/* section where the out-of-sync problems occur.


I think straight single shot patches are out of the question given the
size increase that would entail. But the number of patches required
could be largely reduced by doing some merging of patches:

Given 16 pdiff files the patches could be aranged like this (-16 is
oldes, 0 is current Packages):

-16 ->   0
-15 -> -14 -> -12 -> -8 -> 0
-14 -> -12 -> -8 -> 0
-13 -> -12 -> -8 -> 0
-12 -> - 8 -> 0
-11 -> -10 -> 8 -> 0
-10 -> - 8 -> 0
- 9 -> - 8 -> 0
- 8 ->   0
- 7 -> - 6 -> -4 -> 0
- 6 -> - 4 -> 0
- 5 -> - 4 -> 0
- 4 ->   0
- 3 -> - 2 -> 0
- 2 ->   0
- 1 ->   0

In this scheme the patches are made so that a N day old Packages file
becomes a (N & N-1) old Packages file, the lowest set bit in N is
cleared. After at most log(N) patches you are up-to-date. With the
current limit of 56 pdiff files you never need more than 7 patches.

Because a lot of the patches only go a short distance the overall size
increase should be much lower.



But I'm not convinced the number of files to download is actualy the
limiting factor:

% wget -r -np http://ftp.de.debian.org/debian/dists/sid/main/binary-amd64/Packages.diff/
Total wall clock time: 5.2s
Downloaded: 66 files, 1.2M in 1.1s (1.11 MB/s)

Downloading and applying 56 pdiff files with apt takes a considerable
longer time with apt, in the realm of minutes, not seconds.

So I don't buy the argument that downloading many small files is the
problem. What needs to happen is that at a glance apt will know all the
pdiff files it needs to download, then it needs to pipeline the download
of all of them and last patch the Packages file in a single pass
applying all the patches in the right order at once.

With the current Index format and pdiff files one needs the patch
matching the SHA1 of the Packages file plus all following patches in
that order. So apt could pipeline them all. But that would break
repositories that do have single step patch files. Same with the above
method.

I would like to extend the format of the Index file there so that a
single entry can list multiple patches like this:

SHA1-Current: ced0270ddecf73e18ed604a632cd201d8006980e 29140592
SHA1-History:
 333d3ea2262f49befe07069df22390277587a4fb 29094219 2012-05-06-0214.07 2012-05-06-0813.32 2012-05-07-0813.32
 5bca4bd997d4466c9f7df3fcf77a21e02a152c12 29094451 2012-05-06-0813.32 2012-05-07-0813.32
 2fbf8424e650310c7cd9cf20070543ea036a39bf 29094377 2012-05-06-1414.33 2012-05-06-2013.13 2012-05-07-0813.32
 b08a3149b0be7201e98d74c699c0d5a33d493e84 29105545 2012-05-06-2013.13 2012-05-07-0813.32
 d6fb28816fffb9b74ae60c82d503cc79d531185c 29112085 2012-05-07-0213.45 2012-05-07-0813.32
 359bd6a4bf838c87d9d19b10e41a576784810237 29111799 2012-05-07-0813.32
 8b7482acdeda7514ad887720bda05ab9cf9d7abc 29111794 2012-05-07-1413.33 2012-05-07-2013.52 2012-05-08-0817.24
 6d9b7ccb8e2e6c066ac80e779d64d1f523042ae9 29112045 2012-05-07-2013.52 2012-05-08-0817.24
 12e2733c9a34fff1c86735b266ff315e84a78fc5 29112958 2012-05-08-0213.34 2012-05-08-0817.24
 ef7882176b5277e5be7b38e396c69c8f8167ac35 29117503 2012-05-08-0817.24
 380c5683e0942266c0570f77c989db2e565ca799 29120576 2012-05-08-1413.33 2012-05-08-2018.17
 cad793bca1f269d5f1ec68df1bc59931d7321098 29125630 2012-05-08-2018.17
 1799dbc63e7856d42cec3eb6b4192290f007bc9a 29127145 2012-05-09-0213.20

So instead of a single patch file a list of patch files is given. The
above example would be for the above log(N) method. With the current
totaly incremental files the lines would have 1-56 patches listed.

This change would allow apt to download all the patches it needs in one
go no matter if the patches are completly incremental, partially merged
or single step. And they could all be applied in one step too instead of
in 56 (7) seperate steps with checksum checks inbetween.

What is killing the pdiffs is the stop&go methodic the current
implementation has.

> - One rsync run ought to be enough to mirror all of Debian (or any
> derivate using similar structure). Not X, with various
> include/excludes.

Doing a complete rsync run with --delay-updates shouldn't be to bad.

>> Option A is that each mirror (if it chooses to do it) builds a big "index" of
>> hashsum-named hardlinks to the "old" location of the file. Given a
>> repository like this:
...
>> Option B would be to introduce "versioned" components.

Option C)
=========

The "Mirrors are going to be out-of-sync. Deal with it." option.

Instead of coming up with more and more complex repository layouts or
mirror scripts why not just accept that sometimes mirrors will be
out-of-sync and think of a way we can help the client to recover.

To recover the client needs to get hold of the correct files. So lets
add a location where the client can request them. In InRelease we add

Hash-Service: http://hash.debian.org/hash?h=#
 ftp://hash.debian.org/hash/#2/#
 rsync://hash.debian.org/hash/#2/#

Now when a client needs to download a Packages/Sources/PDiff/Translation
file with SHA1 863c99eb851479720c8930b31a9e85b6535ddab9 and the mirror
does not have the correct file the client can download one of

http://hash.debian.org/hash?h=863c99eb851479720c8930b31a9e85b6535ddab9
ftp://hash.debian.org/hash/86/863c99eb851479720c8930b31a9e85b6535ddab9
rsync://hash.debian.org/hash/86/863c99eb851479720c8930b31a9e85b6535ddab9

In Hash-Services a '#' means the full SHA1 checksum and #<X> the first X
characters of the SHA1 checksum.

This service would only be used when a mirror is out-of-sync and only
for the index files. So traffic should be much less than mirrors usualy
have.

To keep traffic even lower clients should download the
Packages/Sources/Translation files from the local mirror like they do
now. If they then notice a checksum mismatch they should request the
PDiff files from the Hash-Service whenever possible.


Pro:
- Minimal changes to the repository format (add one entry to InRelease)
- Backwards compatible [assuming clients ignore unknown entries in InRelease]
- No cooperation from mirror admins required
- No more (unrecoverable) out-of-sync errors
- a hash.debian.org round-robin pool can be synced prior to the general
  mirror push to ensure all of them are current

Contra:
- single point of failure (or small round-robin pool) for hash.debian.org
- hash.debian.org might get flooded if too many users have a out-of-sync
  mirror
- hash.debian.org needs to store files from dist/* for several mirror
  pushes. Minimum would be 2-3, maximum no more than a weeks worth. So
  there is some disk space required.

Comments?

MfG
        Goswin


Reply to: