[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Hashsum mismatch prevention strategies



On Sat, May 19, 2012 at 11:27:23PM +0200, Joerg Jaspert wrote:
> Though I think before we get there we maybe should go a step further
> back and come out with an agreed on "standard" for dists/. What we
> currently have is largely "oh, lets see, it works, lets do it", with
> everyone of us randomly increasing the things we have in there. And/or
> changing the files itself. I know that, I did that myself. :)
> 
> 
> And I know that Julian already started working on something for that
> using the wiki(.d.o). Right now I just have a few wishlist points:
> 
> - Only *ONE* compression for anything in dists/.
>   To switch to another compression later, a second may be added for the
>   next release, and as soon as that release is out, the old one goes
>   away.[fn:1]
>   The current situation with needlessly doubling the information for
>   years already just sucks.

That should be easy for most cases. You can already drop bzip2 compression
for Packages and Sources if you want to, I don't think anyone really
cares about them. For Translations, it might be more difficult, as they
are currently only available in bzip2 and some might rely on this.

Once you have dropped the bzip2 compressed indices, you might already
have gained enough space to keep an older generation of the indices
around.

> 
> - Only one "release" file, drop away the old Release and Release.gpg.
>   Would anything break right now if I would drop the Release/Release.gpg
>   away for >>squeeze(-*)?

Cupt and Smart do not support InRelease files yet. Smart will probably
get support for them when Ubuntu introduces them, as Canonical is
involved in Smart development and uses Smart for their Landscape
stuff. For Cupt, see bug 623113. Likewise but maybe even more
important, debootstrap and cdebootstrap would both break.



> - Do we need an extra Release file per binary-$arch, saying nothing than
>   what we already know from the location of the directory?

At least APT does not need it, and dselect does not appear to need
it either.

> 
> - Get anything thats not "an index" out of dists/ and keep it out. The
>   installer is already on it, I started that thread before replying
>   here, so that gets out. We should nail it down that nothing else will
>   come in here in future, unless it's an index stuff.
> 
> - Hey, if we are at it, wth binary-$arch, lets rename to $arch only.

I don't think we need to deliberately introduce incompatibilities.


> 
> - Saner diffs. Now that one is a "fun" one, I know, but having something
>   where you don't need to jump through dozens of very small files to end
>   up with the final result, but have one and out comes the result, for
>   example would be one thing. The sheer number of small .diffs makes it
>   unusable as soon as you have large bandwidth. It would be nice(r), i
>   think, if we could have something that lets you go from "x days ago to
>   $now".

You can do this without any format change, just by changing the
algorithm. reprepro already creates diffs from past to current, instead
of incremental ones.

> 
> - One rsync run ought to be enough to mirror all of Debian (or any
>   derivate using similar structure). Not X, with various
>   include/excludes.
> 
> > Option A is that each mirror (if it chooses to do it) builds a big "index" of
> > hashsum-named hardlinks to the "old" location of the file. Given a
> > repository like this:
> 
> I am against doing stuff outside the archive. We should have something
> that we say "this is it. mirror it. be done.". Not "this is it. mirror
> it. now do process XY".

You can also do this in the archive, it does not really have to be
done on the mirror.

> 
> If we can, the ideal solution is one that lets us end up with a mirror
> script that has to run rsync once. No matter that dists/ comes before
> pool/. So the mirror script would be reduced to a small thing doing
> rsync+tracefiles basically.
> 
> > Option B would be to introduce "versioned" components.
> 
> *hate*, sorry. Thats just too ugly IMO.
> 
> Though a variation of that, doing it with "versioned" suites, I would
> hate less. But still not nice.
> 
> Right now my tendency would go to a hash based tree for the indices
> combined with hardlinks for the old tools and also the users.
> 
> > So, in short: What do you think? Is there an option C or are there
> > features/problems in A or B which i have omitted/overseen?
> 
> As you see I don't have a written-out C right now.

There is a third option in the Ubuntu pad, that is basically suffixing
the indices with the hash, let's say

	Packages-d86236a0c540b340986c99e94d0d9159c66b96a34adc2de01e2668f2d3a2ded2.gz

(using SHA256 here). This approach is relatively easy in my opinion. We can
then also add a field to the Release file saying (just for optimisation
purposes, instead of blindly trying the hashes):

	Indices-Hashing: sha256

(although we can just look at e.g. SHA256 as well and see if the hash
link is listed in there) and are basically done. We just keep two of
those files around, one for the past state, one for the current state.
It also seems closely related to the stuff RPM people do.

Another option is to have a file Packages.old.gz, and APT then simply
fetches that one when it notices that Packages.gz is wrong. Should
work as well.

The third option is to do the same .old stuff, but do a
	cp -al dists dists.new
	rsync master:dists ... to  ... dists.new
	mv dists dists.old
	mv dists.new dists

And then fallback to dists.old if there is something wrong
in dists. This should be atomic enough for everyone, and easy
to implement.


-- 
Julian Andres Klode  - Debian Developer, Ubuntu Member

See http://wiki.debian.org/JulianAndresKlode and http://jak-linux.org/.


Reply to: