Re: Hashsum mismatch prevention strategies
David Kalnischkies <firstname.lastname@example.org> writes:
> Hello everyone,
> (everyone who wonders: we¹ decided² that email@example.com might
> be the right place to discuss repository format changes and as this
> might or might not be one lets start it there. cc'ed deity@ just for info)
> Over the course of a few years the idea of making the archive and especially
> the clients more robust against the non-atomic distribution of indexes across
> the whole mirror network is an evergreen without a defined solution nor
> action plan so far.
> Archive updates are happening faster and faster nowadays - debian is at
> 6 hours, ubuntu at times at 1 hour - users and scripts alike are presented
> more and more often with messages from their package managers claiming
> that the metadata doesn't match their expectations.
> (aka: Release files refers to an older/newer Packages/Sources/? file)
> To a certain point these problems can be avoided by deploying more and
> more complex mirrorsync scripts with more or less atomic behavior.
> This doesn't help at all if we talk about round-robins though?
> Basically I/we identified two options to come across these limits:
> We make a change to the repository format or we are changing how the
> clients request the files from the repository (both actually happens in
> both options, but the main focus is divided between the two)
> Option A is that each mirror (if it chooses to do it) builds a big "index" of
"if it chooses to do it" is already a death sentence for this
idea. Official mirrors aren't even using the simple 2 stage rsync script
debian provides for near atomic updates. You won't get them to build
Any solution needs to be something Debian does and mirrors just copy
> hashsum-named hardlinks to the "old" location of the file. Given a
> repository like this:
> this would mean that we e.g. have a
> unstable/by-hash/sha256/bbbbb? -> unstable/main/binary-amd64/Packages
> unstable/by-hash/sha256/ccccc? -> unstable/main/binary-i386/Packages
> (Imagine this being done for e.g. md5 and sha1 hashes, too)
> A client like apt would then request the InRelease file as usual and then take
> the hashes it can extract out of it to request the other files it needs.
> On a mirrorsync the indexes files will be updated and get new hashes,
> but a new client still working with the old InRelease file will still get
> the old indexes files based on their hash. Old clients will run into the
> same problem we have now as nothing has changed from a archive creation point.
What if the mirror has the new InRelease file but not the hashed files
> As the mirror is it who generates the by-hash he would be free to not do it
> and/or to store old indexes for a self-chosen length of time. Given that a
> client needs to fallback for every file it can't get by-hash to request it
> by its "old" location -- and in the long run it has to check for different
> checksums as we move to stronger hashes over time.
> (To make it a tiny bit more user friendly we could let the filenames be
> Packages.$hash and alike as well as having by-hash in different places,
> but this doesn't change the idea itself so i omitted these sub-options
> and discuss instead after an option is chosen)
> Option B would be to introduce "versioned" components. The InRelease file
> would include a tag specifying a version (a good version would e.g. be
> the date(time) of the creation) for the components it includes.
> A client would then not request files under $component but under
> $component-$version, e.g. instead of main it would be main-2012-05-10.
> An old client would "just" follow the link from main to its current version
> off-spin similar to how unstable links to sid. As the InRelease includes
> a new tag a new client will need to use this "feature" we don't need a
Not all index files change on every pulse leading to large scale
duplication of files. Also, since mirrors won't be using any special
debian mirror scripts, this will download them all again and again every
time the version changes.
> On mirrorsync new versioned components end up on the mirror, making
> the update for old clients close to atomic (flip component links,
> new InRelease) and atomic for new clients (new InRelease).
> It's up to the mirror how long he wants to store "old" components.
> Nothing special needs to be done by the mirror, although a clever
> sync-script might improve the "experience" with hardlinks additionally
> to our usual two-stage updating in case of unchanged files in a
> component from one version to the other.
> Bonus: As this versioned one is a complete component it would be
> possible to use it in sources.list (or equivalents) for debugging proposes
> (untrusted - or with a bit more code keeping old InRelease files around
> and download these if such a component is encountered).
Also again what about InRelease being new with the rest still missing?
> Opinion ahead: I tried to be rather neutral up to this point, but i have to
> admit that i might have failed at that as I currently prefer B mainly
> because of not needing fallbacks and having a decent benefit even without
> a new client talking to it, while A (even if decentralize is nice) feels
> more like trying to bash a new protocol through an "old" protocol by
> requesting hashes with filenames because we have no protocol supporting
> requests by multiple hashes instead.
> (jftr: Neither is that an impression carved into stone nor is it unopposed)
> So, in short: What do you think? Is there an option C or are there
> features/problems in A or B which i have omitted/overseen?
> Disclaimer: This topic was part of discussions at the ubuntu's developer
> summit. You can find a collection of notes by various people at , but
> this requires you to have a launchpad account and join the ubuntu-etherpad
> group  *sign* (there is properly a reason which is just behind my
> understanding, so i am not going to copy the content to a pastebin, sry?)
> so you might as well just trust me that I have summarized it correctly.
> At  you can find a longer story of why we need to improve with a
> proposed solution similar to B just with versioning at the individual
> file level.
> Best regards
> David Kalnischkies
>  http://pad.ubuntu.com/uds-q-servercloud-q-apt-improvements
>  https://launchpad.net/~ubuntu-etherpad
>  https://bugs.launchpad.net/ubuntu/+source/apt/+bug/972077
> ¹ "I" ² "was told" in
> P.S.: I am aware that B doesn't cover Contents files currently, but frankly,
> these files should be in $(component)/binary-$(arch)/Contents for all kinds
> of reasons but that is a different topic and not part of the problem in so
> far as that apt-file as far as i know doesn't check the hashes?
> (which we hopefully can fix with apt's gsoc project, but thats yet another
> different topic?)
I think you are missing part of the problem and thus your solution can't
work. You seem to assume the InRelease file is the last to be updated
while it just as well might be the first or somewhere inbetween.
A proper solution has to work with the index files being updated in
pretty much any order. Secondly the mirror space and bandwidth should
not increase too much. As such I much prefer a solution with hashes over
one with versions, unless they are per file.
And there is one fact apt is not utilizing: We have mirrors, meaning
multiple sources for the same information. We should use untrusted
information from mirror A if we can establish a trust path for it from
mirror B. So if we can't verify Packages.gz from mirror A but mirror B
has a correct InRelease file listing the checksum for Packages.gz from
mirror A then the Packages.gz file from mirror A should become trusted.
In fact Packages.gz from mirror B should not even be downloaded if we
already got a matching one from mirror A. Same for Sources.gz and even
individual debs (any file listed in an index). If we can establish a
trust path for a file on mirror A by way of mirror B then the file on
mirror A should be used. Thereby an orig.tar.ext file could be
downloaded from A even if diff+dsc are downloaded from B. In general
files should be uniquely identified by their hash and not by their url.
That said there still remains the problem of individual mirrors having
out-of-sync index files. Not everyone has multiple mirrors in their
sources.list. For that we have to consider 2 cases:
1) InRelease is older than Packages
To be able to use the outdated checksum the old file must remain
This can be achieved by a variant of option A above. But do this on
ftp-master. Personally I would prefer Packages.<hash> over
unstable/by-hash/sha256/bbbbb. Packages should be a link to the latest
for compatibility. ftp-master should retain the current, last and
second last of all files listed in InRelease. (Why? see below)
2) InRelease is newer than Packages
To fix this we need the older checksum so the older file can still be
This requires a change in the InRelease file. A simple solution would be
to have a '<sum>-old' field for every '<sum>' field we have listing the
checksum, size and name of the files from the last mirror run.
Combine the two and even during a mirror pulse a mirror will always have
a full set of InRelease, Packages, Sources, Contents, ... files that is
consistent. The client will be able to verify the current or last set of
files and mirror will have the second last set of files ensuring that
they don't delete files before users can no longer get an InRelease file
Worst case (sid) this will tripple the space required for index
files. But esspecially for stable the files rarely change. One could
still do mirror pulse making the last and second last set of index files
in InRelease be identical to the current one and therefor freeing up the