Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)

To: debian-devel@lists.debian.org, debian-dpkg@lists.debian.org
Subject: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
From: Russ Allbery <rra@debian.org>
Date: Mon, 13 Feb 2012 22:43:04 -0800
Message-id: <[🔎] 874nutncef.fsf_-_@windlord.stanford.edu>
In-reply-to: <[🔎] 20120211185237.GA10129@virgil.dodds.net> (Steve Langasek's message of "Sat, 11 Feb 2012 10:52:37 -0800")
References: <[🔎] 20120206073115.GB2033@rivendell.home.ouaza.com> <[🔎] 20120207095921.d5142d88cbb3dca679f33ec9@debian.org> <[🔎] 20120210225620.GA8782@gaara.hadrons.org> <[🔎] 20120211001446.GB2797@jwilk.net> <[🔎] 20120211005559.GA32671@burratino> <[🔎] 20120211011629.GB20155@virgil.dodds.net> <[🔎] 87zkcqrw2w.fsf@windlord.stanford.edu> <[🔎] 20120211185237.GA10129@virgil.dodds.net>

There's been a lot of discussion of this, but it seems to have been fairly
inconclusive.  We need to decide what we're doing, if anything, for wheezy
fairly soon, so I think we need to try to drive this discussion to some
concrete conclusions.

First, Steve's point here is very good:

Steve Langasek <vorlon@debian.org> writes:

> I guess we're looking at the same data, yet we seem to have reached
> opposite conclusions.

>  - Riku reports that 33 out of 82k files have different compression when
>    using current gzip vs. 10-year-old gzip.  I'd be surprised if any of
>    those binary packages hadn't been superseded long ago.  It's not a
>    guarantee, but I think the risks, and ultimate cost, of relying on gzip
>    output to not change often and to just do sourceful rebuilds when it
>    isn't are a lot smaller than if we go about manually splitting our
>    packages further.

>  - The cases where gzip output has been reported to not be reproducible
>    seem to all boil down to a single issue with gzip being passed
>    different arguments due to the unreproducible nature of *find*'s
>    output.  A patch has been made available already on the bug, and this
>    patch seems to address the instances of the problem that we've hit so
>    far in the Ubuntu archive.

> Now, it's worth following up with gzip upstream about our concerns, but
> even without that, I just don't see this being problematic.

It isn't the end of the world if we have some conflicts provided that we
can detect them and can do something consistent to fix them.  I'm rather
nervous about relying on reproducibility of gzip because of Joey's
experience with pristine-tar, where he does find a lot of variation in
practice, but it is true that, for the purposes of multiarch, Debian *can*
possibly construct things such that we only need to worry about our own
gzip, which does simplify the situation.

However, as we've subsequently discussed, those are not the only issues
with file overlaps between packages.  So I'm going to try to summarize and
propose some possible solutions for the different issues.  I'm going to
discuss these issues in order from the most consistent with a refcounting
solution to the least consistent.

1. Uncompressed files that we know are absolutely identical between
   different architectures.  These include arch-independent header files
   that are just copied verbatim from the upstream source and data files
   in textual formats or arch-independent binary formats that aren't
   compressed and whose generation doesn't vary.  (Symlinks are a special
   case of this.)  Reference counting works great for these.  These also
   resolve most of the file overlaps between -dev packages, and many of
   the harder cases for interpackage dependencies if we split everything
   out.  I think it makes a lot of sense to use refcounting for these
   files.

2. Files like the above but that are compressed.  This is most common in
   the doc directory for things like README or the upstream changelog.
   Upstream man pages written directly in *roff fall into this category as
   well, for -dev packages.  With Steve's point above about gzip, I think
   we're probably okay using refcounting for this as well.

3. Generated documentation.  Here's where I think refcounting starts
   failing.  Man pages generated from POD may change if the version of
   Perl used to generate them changes, if Pod::Simple or Pod::Man have had
   a new release.  Doxygen-generated HTML documentation is even more
   likely to change.  Many documentation generation systems will include
   timestamps or other information that changes, or (even more likely)
   will have minor changes in their output and formatting even if there is
   nothing as obvious as a version number or timestamp.

   I don't think we can use refcounting for generated documentation
   produced as part of the package build process.  If there is
   Doxygen-generated documentation, generated man pages, or the like, I
   think those have to be split into a separate arch: all package.  Even
   if it's just a couple of man pages.  This is rather annoying, but I
   think trying to use refcounting here is just too fragile.

4. Lintian overrides.  I believe these should be qualified with the
   architecture on any multiarch: same package so that the overrides can
   vary by architecture, since this is a semi-frequent use case for
   Lintian.

5. Data files that vary by architecture.  This includes big-endian
   vs. little-endian issues.  These are simply incompatible with multiarch
   as currently designed, and incompatible with the obvious variations
   that I can think of, and will have to either be moved into
   arch-qualified directories (with corresponding patches to the paths
   from which the libraries load the data) or these packages can't be made
   multiarch.

6. Debian changelogs.  The actual content of these files change with
   binNMUs, so these obviously can't be refcounted at all right now.  We
   have to do something else here, probably by generating new
   binary-specific changelog files for binNMUs.

Does this seem comprehensive to everyone?  Am I missing any cases?

If this is comprehensive, then I propose the following path forward, which
is a mix of the various solutions that have been discussed:

* dpkg re-adds the refcounting implementation for multiarch, but along
  with a Policy requirement that packages that are multiarch must only
  contain files in classes 1 and 2 above.

* All packages that want to be multiarch: same have to move all generated
  documentation into a separate package unless the maintainer has very
  carefully checked that the generated documentation will be byte-for-byte
  identical even across minor updates of the documentation generation
  tools and when run at different times.

* Lintian should recognize arch-qualified override files, and multiarch:
  same packages must arch-qualify their override files.  debhelper
  assistance is desired for this.

* Policy prohibits arch-varying data files in multiarch: same packages
  except in arch-qualified paths.

* The binNMU process is changed to add the binNMU changelog entry to an
  arch-qualified file (changelog.Debian.arch, probably).  We need to
  figure out what this means if the package being binNMU'd has a
  /usr/share/doc/<package> symlink to another package, though; it's not
  obvious what to do here.

Please note that this is a bunch of work.  I think the Lintian work is a
good idea regardless, and it can start independently.  I think the same is
true of the binNMU changelog work, since this will address some
long-standing issues with changelog handling in some situations, including
resolving just how we're supposed to handle /usr/share/doc symlinks.  But
even with those aside, this is a lot of stuff that we need to agree on,
and in some cases implement, in a fairly short timeline if this is going
to make wheezy.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

Follow-Ups:
- Re: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
  - From: Raphael Hertzog <hertzog@debian.org>
- Re: Multiarch file overlap summary and proposal
  - From: Andreas Beckmann <debian@abeckmann.de>
- Re: Multiarch file overlap summary and proposal
  - From: Niels Thykier <niels@thykier.net>
- Re: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
  - From: Guillem Jover <guillem@debian.org>
- Re: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
  - From: Josselin Mouette <joss@debian.org>
- Re: Multiarch file overlap summary and proposal
  - From: Marvin Renich <mrvn@renich.org>
- Re: Multiarch file overlap summary and proposal
  - From: Goswin von Brederlow <goswin-v-b@web.de>

References:
- Please test dpkg with multiarch support
  - From: Raphael Hertzog <hertzog@debian.org>
- Re: Please test gzip -9n - related to dpkg with multiarch support
  - From: Neil Williams <codehelp@debian.org>
- Summary: dpkg shared / reference counted files and version match
  - From: Guillem Jover <guillem@debian.org>
- Re: Summary: dpkg shared / reference counted files and version match
  - From: Jakub Wilk <jwilk@debian.org>
- Re: Summary: dpkg shared / reference counted files and version match
  - From: Jonathan Nieder <jrnieder@gmail.com>
- Re: Summary: dpkg shared / reference counted files and version match
  - From: Steve Langasek <vorlon@debian.org>
- Re: Summary: dpkg shared / reference counted files and version match
  - From: Russ Allbery <rra@debian.org>
- Re: Summary: dpkg shared / reference counted files and version match
  - From: Steve Langasek <vorlon@debian.org>

Prev by Date: Re: Multi-arch all-architecture plugins
Next by Date: Re: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
Previous by thread: Re: Summary: dpkg shared / reference counted files and version match
Next by thread: Re: Multiarch file overlap summary and proposal (was: Summary: dpkg shared / reference counted files and version match)
Index(es):
- Date
- Thread