Hashsum mismatch prevention strategies
(everyone who wonders: we¹ decided² that email@example.com might
be the right place to discuss repository format changes and as this
might or might not be one lets start it there. cc'ed deity@ just for info)
Over the course of a few years the idea of making the archive and especially
the clients more robust against the non-atomic distribution of indexes across
the whole mirror network is an evergreen without a defined solution nor
action plan so far.
Archive updates are happening faster and faster nowadays - debian is at
6 hours, ubuntu at times at 1 hour - users and scripts alike are presented
more and more often with messages from their package managers claiming
that the metadata doesn't match their expectations.
(aka: Release files refers to an older/newer Packages/Sources/… file)
To a certain point these problems can be avoided by deploying more and
more complex mirrorsync scripts with more or less atomic behavior.
This doesn't help at all if we talk about round-robins though…
Basically I/we identified two options to come across these limits:
We make a change to the repository format or we are changing how the
clients request the files from the repository (both actually happens in
both options, but the main focus is divided between the two)
Option A is that each mirror (if it chooses to do it) builds a big "index" of
hashsum-named hardlinks to the "old" location of the file. Given a
repository like this:
this would mean that we e.g. have a
unstable/by-hash/sha256/bbbbb… -> unstable/main/binary-amd64/Packages
unstable/by-hash/sha256/ccccc… -> unstable/main/binary-i386/Packages
(Imagine this being done for e.g. md5 and sha1 hashes, too)
A client like apt would then request the InRelease file as usual and then take
the hashes it can extract out of it to request the other files it needs.
On a mirrorsync the indexes files will be updated and get new hashes,
but a new client still working with the old InRelease file will still get
the old indexes files based on their hash. Old clients will run into the
same problem we have now as nothing has changed from a archive creation point.
As the mirror is it who generates the by-hash he would be free to not do it
and/or to store old indexes for a self-chosen length of time. Given that a
client needs to fallback for every file it can't get by-hash to request it
by its "old" location -- and in the long run it has to check for different
checksums as we move to stronger hashes over time.
(To make it a tiny bit more user friendly we could let the filenames be
Packages.$hash and alike as well as having by-hash in different places,
but this doesn't change the idea itself so i omitted these sub-options
and discuss instead after an option is chosen)
Option B would be to introduce "versioned" components. The InRelease file
would include a tag specifying a version (a good version would e.g. be
the date(time) of the creation) for the components it includes.
A client would then not request files under $component but under
$component-$version, e.g. instead of main it would be main-2012-05-10.
An old client would "just" follow the link from main to its current version
off-spin similar to how unstable links to sid. As the InRelease includes
a new tag a new client will need to use this "feature" we don't need a
On mirrorsync new versioned components end up on the mirror, making
the update for old clients close to atomic (flip component links,
new InRelease) and atomic for new clients (new InRelease).
It's up to the mirror how long he wants to store "old" components.
Nothing special needs to be done by the mirror, although a clever
sync-script might improve the "experience" with hardlinks additionally
to our usual two-stage updating in case of unchanged files in a
component from one version to the other.
Bonus: As this versioned one is a complete component it would be
possible to use it in sources.list (or equivalents) for debugging proposes
(untrusted - or with a bit more code keeping old InRelease files around
and download these if such a component is encountered).
Opinion ahead: I tried to be rather neutral up to this point, but i have to
admit that i might have failed at that as I currently prefer B mainly
because of not needing fallbacks and having a decent benefit even without
a new client talking to it, while A (even if decentralize is nice) feels
more like trying to bash a new protocol through an "old" protocol by
requesting hashes with filenames because we have no protocol supporting
requests by multiple hashes instead.
(jftr: Neither is that an impression carved into stone nor is it unopposed)
So, in short: What do you think? Is there an option C or are there
features/problems in A or B which i have omitted/overseen?
Disclaimer: This topic was part of discussions at the ubuntu's developer
summit. You can find a collection of notes by various people at , but
this requires you to have a launchpad account and join the ubuntu-etherpad
group  *sign* (there is properly a reason which is just behind my
understanding, so i am not going to copy the content to a pastebin, sry…)
so you might as well just trust me that I have summarized it correctly.
At  you can find a longer story of why we need to improve with a
proposed solution similar to B just with versioning at the individual
¹ "I" ² "was told" in
P.S.: I am aware that B doesn't cover Contents files currently, but frankly,
these files should be in $(component)/binary-$(arch)/Contents for all kinds
of reasons but that is a different topic and not part of the problem in so
far as that apt-file as far as i know doesn't check the hashes…
(which we hopefully can fix with apt's gsoc project, but thats yet another