Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
There's been a lot of discussion of this, but it seems to have been fairly
inconclusive. We need to decide what we're doing, if anything, for wheezy
fairly soon, so I think we need to try to drive this discussion to some
First, Steve's point here is very good:
Steve Langasek <firstname.lastname@example.org> writes:
> I guess we're looking at the same data, yet we seem to have reached
> opposite conclusions.
> - Riku reports that 33 out of 82k files have different compression when
> using current gzip vs. 10-year-old gzip. I'd be surprised if any of
> those binary packages hadn't been superseded long ago. It's not a
> guarantee, but I think the risks, and ultimate cost, of relying on gzip
> output to not change often and to just do sourceful rebuilds when it
> isn't are a lot smaller than if we go about manually splitting our
> packages further.
> - The cases where gzip output has been reported to not be reproducible
> seem to all boil down to a single issue with gzip being passed
> different arguments due to the unreproducible nature of *find*'s
> output. A patch has been made available already on the bug, and this
> patch seems to address the instances of the problem that we've hit so
> far in the Ubuntu archive.
> Now, it's worth following up with gzip upstream about our concerns, but
> even without that, I just don't see this being problematic.
It isn't the end of the world if we have some conflicts provided that we
can detect them and can do something consistent to fix them. I'm rather
nervous about relying on reproducibility of gzip because of Joey's
experience with pristine-tar, where he does find a lot of variation in
practice, but it is true that, for the purposes of multiarch, Debian *can*
possibly construct things such that we only need to worry about our own
gzip, which does simplify the situation.
However, as we've subsequently discussed, those are not the only issues
with file overlaps between packages. So I'm going to try to summarize and
propose some possible solutions for the different issues. I'm going to
discuss these issues in order from the most consistent with a refcounting
solution to the least consistent.
1. Uncompressed files that we know are absolutely identical between
different architectures. These include arch-independent header files
that are just copied verbatim from the upstream source and data files
in textual formats or arch-independent binary formats that aren't
compressed and whose generation doesn't vary. (Symlinks are a special
case of this.) Reference counting works great for these. These also
resolve most of the file overlaps between -dev packages, and many of
the harder cases for interpackage dependencies if we split everything
out. I think it makes a lot of sense to use refcounting for these
2. Files like the above but that are compressed. This is most common in
the doc directory for things like README or the upstream changelog.
Upstream man pages written directly in *roff fall into this category as
well, for -dev packages. With Steve's point above about gzip, I think
we're probably okay using refcounting for this as well.
3. Generated documentation. Here's where I think refcounting starts
failing. Man pages generated from POD may change if the version of
Perl used to generate them changes, if Pod::Simple or Pod::Man have had
a new release. Doxygen-generated HTML documentation is even more
likely to change. Many documentation generation systems will include
timestamps or other information that changes, or (even more likely)
will have minor changes in their output and formatting even if there is
nothing as obvious as a version number or timestamp.
I don't think we can use refcounting for generated documentation
produced as part of the package build process. If there is
Doxygen-generated documentation, generated man pages, or the like, I
think those have to be split into a separate arch: all package. Even
if it's just a couple of man pages. This is rather annoying, but I
think trying to use refcounting here is just too fragile.
4. Lintian overrides. I believe these should be qualified with the
architecture on any multiarch: same package so that the overrides can
vary by architecture, since this is a semi-frequent use case for
5. Data files that vary by architecture. This includes big-endian
vs. little-endian issues. These are simply incompatible with multiarch
as currently designed, and incompatible with the obvious variations
that I can think of, and will have to either be moved into
arch-qualified directories (with corresponding patches to the paths
from which the libraries load the data) or these packages can't be made
6. Debian changelogs. The actual content of these files change with
binNMUs, so these obviously can't be refcounted at all right now. We
have to do something else here, probably by generating new
binary-specific changelog files for binNMUs.
Does this seem comprehensive to everyone? Am I missing any cases?
If this is comprehensive, then I propose the following path forward, which
is a mix of the various solutions that have been discussed:
* dpkg re-adds the refcounting implementation for multiarch, but along
with a Policy requirement that packages that are multiarch must only
contain files in classes 1 and 2 above.
* All packages that want to be multiarch: same have to move all generated
documentation into a separate package unless the maintainer has very
carefully checked that the generated documentation will be byte-for-byte
identical even across minor updates of the documentation generation
tools and when run at different times.
* Lintian should recognize arch-qualified override files, and multiarch:
same packages must arch-qualify their override files. debhelper
assistance is desired for this.
* Policy prohibits arch-varying data files in multiarch: same packages
except in arch-qualified paths.
* The binNMU process is changed to add the binNMU changelog entry to an
arch-qualified file (changelog.Debian.arch, probably). We need to
figure out what this means if the package being binNMU'd has a
/usr/share/doc/<package> symlink to another package, though; it's not
obvious what to do here.
Please note that this is a bunch of work. I think the Lintian work is a
good idea regardless, and it can start independently. I think the same is
true of the binNMU changelog work, since this will address some
long-standing issues with changelog handling in some situations, including
resolving just how we're supposed to handle /usr/share/doc symlinks. But
even with those aside, this is a lot of stuff that we need to agree on,
and in some cases implement, in a fairly short timeline if this is going
to make wheezy.
Russ Allbery (email@example.com) <http://www.eyrie.org/~eagle/>