[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hashsum mismatch prevention strategies

On Sat, May 19, 2012 at 11:27 PM, Joerg Jaspert <joerg@debian.org> wrote:
> On 12843 March 1977, David Kalnischkies wrote:

^^ That is a lovely timestamp. (SCNR)

>> Archive updates are happening faster and faster nowadays - debian is at
>> 6 hours, ubuntu at times at 1 hour
> Im pretty sure Debian won't go down below 6 hours any time soon and if
> we really do so, I wouldn't think hourly is a sane thing to do for an
> archive like ours. Anyways, different topic.

Sure, but think of all the kitties^Wderivatives: Some might be not as
strongly organized as debian. Many properly follow a more relaxed "archive
updates then needed" approach which might mean a few days or a few minutes
between them at times. But yeah, that is a different topic…

>>- users and scripts alike are presented
>> more and more often with messages from their package managers claiming
>> that the metadata doesn't match their expectations.
>> (aka: Release files refers to an older/newer Packages/Sources/… file)
> Currently the translation breakage, though the latest ftpsync I released
> yesterday should fix this a bit for the user experience.

Thanks! This properly reduces the hit-rate again to a bearable amount,
but i fear at least one user will keep hitting it [0] ;)
(i really don't know how he does that…)

[0] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576184

>> To a certain point these problems can be avoided by deploying more and
>> more complex mirrorsync scripts with more or less atomic behavior.
> Yah, we need a better dists/ and not even more logic in ftpsync.
>> This doesn't help at all if we talk about round-robins though…
> Oh sure, just use staged pushing from ftpsync. :)

What i meant was two mirrors on the same domain with different
update times. One of the two will always be the one which has
a file earlier than the other.

> - Only *ONE* compression for anything in dists/.
>  To switch to another compression later, a second may be added for the
>  next release, and as soon as that release is out, the old one goes
>  away.[fn:1]

FTR: In theory you can do this already with apt -
minus the pdiffs which are hardcoded as gzip files.
(the indexes do not mention a compression type)

> - Only one "release" file, drop away the old Release and Release.gpg.
>  Would anything break right now if I would drop the Release/Release.gpg
>  away for >>squeeze(-*)?

Again, in theory you can do this with apt as wheezy supports InRelease
and use it if available and only falls back to Release{,.gpg} if not.
(as Debians is basically the only archive with InRelease so far)

> - Do we need an extra Release file per binary-$arch, saying nothing than
>  what we already know from the location of the directory?

As Julian noted in his draft mail apt doesn't use them.
I don't know which tools might use them…
They at least don't look that useful.

> - Get anything thats not "an index" out of dists/ and keep it out. The
>  installer is already on it, I started that thread before replying
>  here, so that gets out. We should nail it down that nothing else will
>  come in here in future, unless it's an index stuff.

Only related, but while reordering:
As hinted in the first mail, Contents should be in

The only change is that e.g. apt-file search /non-free/firmware/file
will not give a result anymore on systems without non-free.
(which is a feature or a "bug" mostly depending on how libre you are).

> - Hey, if we are at it, wth binary-$arch, lets rename to $arch only.

That might be a painful transition - at least apt hardcodes the path.
I guess it is not worth the effort, beside that it has some benefits to
have architectures easily separated from sources, i18n and such.

> - Saner diffs. Now that one is a "fun" one, I know, but having something
>  where you don't need to jump through dozens of very small files to end
>  up with the final result, but have one and out comes the result, for
>  example would be one thing. The sheer number of small .diffs makes it
>  unusable as soon as you have large bandwidth. It would be nice(r), i
>  think, if we could have something that lets you go from "x days ago to
>  $now".

(what follows is basically the short version of Goswin as i hadn't
 his response while writing this one)

There are old threads about that. It is supported in apt already to
"skip" patches, so if you order the indexes correctly you can
already do that today (and i think reprepro supports creating this).
But this doesn't remove small diffs, it adds more of them as you need
to provide for each mirrorsync a way to move to the days-skip diff.
(at least if very short paths are desired)

Might be better to tell apt to download all diffs in a row and merge
them themself instead of downloading and applying each individually.
(There is an old prove of concept for that, too. Just can't find it now)

> - One rsync run ought to be enough to mirror all of Debian (or any
>  derivate using similar structure). Not X, with various
>  include/excludes.

Yeah, but how? I don't see a scenario in which updating InRelease
too early is a recoverable situation (or at least a situation in which
we don't download data we later can't validate).

>> Option A is that each mirror (if it chooses to do it) builds a big "index" of
>> hashsum-named hardlinks to the "old" location of the file. Given a
>> repository like this:
> I am against doing stuff outside the archive. We should have something
> that we say "this is it. mirror it. be done.". Not "this is it. mirror
> it. now do process XY".

This was a misunderstanding on my side. Meant was that this
by-hash thingy actually is created by the master archive and the mirrors
just sync it -- but with the benefit that the mirrors don't need to sync it
as they could create it themselves and therefore also can adopt this
option before the master archive does.

Scott Moser promised he would write a mail to correct my error
here and to defend the idea in general. Will see then this will happen…

>> Option B would be to introduce "versioned" components.
> *hate*, sorry. Thats just too ugly IMO.
> Though a variation of that, doing it with "versioned" suites, I would
> hate less. But still not nice.
> Right now my tendency would go to a hash based tree for the indices
> combined with hardlinks for the old tools and also the users.

:) Fair enough. I realized that B hard-depends on mirrors adopting
"clever" ftpsync scripts, which looked like a nice illusion for a while.

B was my on the spot invention to tackle problems i had with A.
First that the client doesn't know if the mirror will support it or not.
Second that a hash carries no useful information, but might be a
nightmare to get right of if we want to transition to another hash.
(Beside third that i don't like the idea of implementing something
 which might or might not be adopt by any or only some if it is
 such a dramatic change - adding fuel to First)

I had intermediate ideas with versioning the individual indexes and
include them as usual in the InRelease file, but discarded them quickly
with the silly idea that stuff i don't know a thing about (mirrors) might be
easier to change than apt and all other clients.

Adding multiple versions to the InRelease file increases its size quiet
a bit up to a point there compression would be a topic, but this couldn't
be done in a sane way hence a global version-tag and as it seemed to
be easier to version just a directory (e.g. the component) than all files
(as this includes a multitude of compatible links) B was born.

>> Disclaimer: This topic was part of discussions at the ubuntu's developer
>> summit. You can find a collection of notes by various people at [0], but
>> this requires you to have a launchpad account and join the ubuntu-etherpad
>> group [1] *sign* (there is properly a reason which is just behind my
>> understanding, so i am not going to copy the content to a pastebin, sry…)
>> so you might as well just trust me that I have summarized it
>> correctly.
> Right, that wont happen, so that thing basically doesn't exist. :)

for the record: recordings are available (pun intended)
haven't tested them yet, but usually they are okayish.


(^^ the first few minutes of the second session are properly nonsense
 as the required people needed a few minutes to come by)

Best regards

David Kalnischkies

P.S.: I am subscribed to dak@ as well as deity@, so no need for a cc.

Reply to: