[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hashsum mismatch prevention strategies

On 12843 March 1977, David Kalnischkies wrote:

> Archive updates are happening faster and faster nowadays - debian is at
> 6 hours, ubuntu at times at 1 hour

Im pretty sure Debian won't go down below 6 hours any time soon and if
we really do so, I wouldn't think hourly is a sane thing to do for an
archive like ours. Anyways, different topic.

>- users and scripts alike are presented
> more and more often with messages from their package managers claiming
> that the metadata doesn't match their expectations.
> (aka: Release files refers to an older/newer Packages/Sources/… file)

Currently the translation breakage, though the latest ftpsync I released
yesterday should fix this a bit for the user experience.

> To a certain point these problems can be avoided by deploying more and
> more complex mirrorsync scripts with more or less atomic behavior.

Yah, we need a better dists/ and not even more logic in ftpsync.

> This doesn't help at all if we talk about round-robins though…

Oh sure, just use staged pushing from ftpsync. :)

> Basically I/we identified two options to come across these limits:

I am not sure I like the options we have right now, but we need
something that not only makes the indices updates (as much as possible)
atomic, but keeps history for at least the previous run, yes.

Though I think before we get there we maybe should go a step further
back and come out with an agreed on "standard" for dists/. What we
currently have is largely "oh, lets see, it works, lets do it", with
everyone of us randomly increasing the things we have in there. And/or
changing the files itself. I know that, I did that myself. :)

And I know that Julian already started working on something for that
using the wiki(.d.o). Right now I just have a few wishlist points:

- Only *ONE* compression for anything in dists/.
  To switch to another compression later, a second may be added for the
  next release, and as soon as that release is out, the old one goes
  The current situation with needlessly doubling the information for
  years already just sucks.

- Only one "release" file, drop away the old Release and Release.gpg.
  Would anything break right now if I would drop the Release/Release.gpg
  away for >>squeeze(-*)?

- Do we need an extra Release file per binary-$arch, saying nothing than
  what we already know from the location of the directory?

- Get anything thats not "an index" out of dists/ and keep it out. The
  installer is already on it, I started that thread before replying
  here, so that gets out. We should nail it down that nothing else will
  come in here in future, unless it's an index stuff.

- Hey, if we are at it, wth binary-$arch, lets rename to $arch only.

- Saner diffs. Now that one is a "fun" one, I know, but having something
  where you don't need to jump through dozens of very small files to end
  up with the final result, but have one and out comes the result, for
  example would be one thing. The sheer number of small .diffs makes it
  unusable as soon as you have large bandwidth. It would be nice(r), i
  think, if we could have something that lets you go from "x days ago to

- One rsync run ought to be enough to mirror all of Debian (or any
  derivate using similar structure). Not X, with various

> Option A is that each mirror (if it chooses to do it) builds a big "index" of
> hashsum-named hardlinks to the "old" location of the file. Given a
> repository like this:

I am against doing stuff outside the archive. We should have something
that we say "this is it. mirror it. be done.". Not "this is it. mirror
it. now do process XY".

If we can, the ideal solution is one that lets us end up with a mirror
script that has to run rsync once. No matter that dists/ comes before
pool/. So the mirror script would be reduced to a small thing doing
rsync+tracefiles basically.

> Option B would be to introduce "versioned" components.

*hate*, sorry. Thats just too ugly IMO.

Though a variation of that, doing it with "versioned" suites, I would
hate less. But still not nice.

Right now my tendency would go to a hash based tree for the indices
combined with hardlinks for the old tools and also the users.

> So, in short: What do you think? Is there an option C or are there
> features/problems in A or B which i have omitted/overseen?

As you see I don't have a written-out C right now.

> Disclaimer: This topic was part of discussions at the ubuntu's developer
> summit. You can find a collection of notes by various people at [0], but
> this requires you to have a launchpad account and join the ubuntu-etherpad
> group [1] *sign* (there is properly a reason which is just behind my
> understanding, so i am not going to copy the content to a pastebin, sry…)
> so you might as well just trust me that I have summarized it
> correctly.

Right, that wont happen, so that thing basically doesn't exist. :)


[fn:1] Right now I would say we keep gzip, kill of bzip2, and add
       xz. gzip is killed the day after wheezy is released, xz stays
       alone. Now, thats from a user POV, from a mirror POV we go with
       gzip and never consider any other compression in dists/, as
       --rsyncable is THEWIN, kthxbye.

[fn:2] If they are not, they are f*cked already. We have numerous
       symlinks in our tree AND we are running tools over the archive,
       every dinstall, which hardlink identical files to each other. If
       you don't support that, your mirror is at least bloated to no end
       right now anyways.

bye, Joerg
(Irgendwo von heise.de):
Jesus war ein typischer Student:
- Lebte bis er 30 war bei den Eltern, - Hatte lange Haare
- Wenn er mal was tat dann wars ein Wunder

Reply to: