[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Summary: dpkg shared / reference counted files and version match



[ Obviously this “summary” could be considered biased, but I do think
  the facts presented are accurate. ]

Hi,

The two reasons for the shared / reference counted files (refcnt from
now on) implementation in dpkg have been:

* To avoid massive package proliferation (due to the mandated copyright
  and changelog files), thus the work involved in a one time split and
  the size increase in Packages indices.

* To avoid unneeded file duplication, thus wasted space (due to those
  mandated files, but also partially just as a consequence of not
  splitting files into new arch:all packages, per above).


This has the following implications:

* Deploying refcnt means that M-A:same packages must always be at the
  same exact installed version, so that the file contents can match.
  ↓
  More difficult upgrade paths, as this ties the different arch
  dependency trees around M-A:same barriers.

* binNMUs need to be performed in lockstep for *all* architectures,
  because the installed versions need to match.
  ↓
  Causing useless buildd usage and user downloads for arches not
  affected. “Fixing” this by making dpkg treat binNMU versions specially,
  besides being just another special case needed for M-A:same packages,
  would be wrong, as arch-indep content can actually change between
  builds, ex. generated documentation.

* binNMUs for the same version might not be co-installable because doc
  generators, compressors, etc, might not always produce the same output.
  ↓
  This is a pretty fragile thing to rely on. New architectures or local
  builds might give a hard time if generated output changed in the past.
  A possible fix, but only for the compressed files case might be to ship
  them uncompresesd, but that counters the desire to reduce wasted space.

* binNMUs for the same version cannot be co-installed anyway as their
  changelogs differ.
  ↓
  That could be “fixed” by using the same email address and a hardcoded
  date, or not including the binNMU entry at all, or moving that entry
  to a new field, etc. All of which seem like ugly hacks, or a possible
  loss of information.

* It means special casing M-A:same on indentical file conflict.
  ↓
  The same thing could be argued to be made possible for packages
  generated from the same source, a “problem” we've always had and
  managed just fine up to now with changes at the packaging level.

* Once implemented, this “feature” cannot be taken out, *ever*.
  ↓
  Because it will produce installation errors, and a long transition
  would not help because that would not guarantee external or old
  packages are fine.


Conclusion
----------

The above means that binNMUs will be currently unusable for any source
package building an M-A:same package, making the release team's job
harder, or requiring sourceful uploads by maintainers instead.

Given the numbers seen on this thread, the estimated amount of new
required packages to be split off is actually pretty low (less than %2
of the current total), and new arch:all packages should be actually
considered cheap as long as the payload weighs more than the metadata
and the binary format itself, and they should generally actually reduce
archive and multiarch DVD space usage; for Packages indices there's
pdiffs which (although not currently optimally implemented) should only
get downloaded on specific package updates and Descriptions are only
downloaded once nowadays. And these are nothing compared to the amount
of new packages pulled in per each foreign arch configured.

It's been mentioned that splitting packages is a daft idea because
it causes more burden to library package maintainers while dpkg could
do the job once instead, but this is a progressive one time thing, while
the above implications are *forever*, and if maintainers are required to
do sourceful uploads instead of getting binNMUs done it actually means
it's going to be even more of a burden for them.

Even if no packages were to get split off and all arch-indep file
paths be arch-qualified (which would actually be wrong in some cases as
some of those arch-indep files should not get an arch-qualified path),
and the overhead of the duplicated files was considered an issue,
(although the actual libraries will usually use way more space than
those few duped files), there's always --path-exclude for the ones not
affecting functionality.

It does not seem to make sense to consider the “huge” space usage due
to not refcnt'ing an issue, when for that to happen and be significant
one would need to install hundreds of M-A:same packages for multiple
architectures, taking hundreds of MiB (if not GiB), at which point I'm
not sure how one can make a fuss over some hundred wasted MiB, if at
all.

For the unreliable generated output problem, even if gzip is to be
considered frozen and in maintenance mode now, that does not mean this
could not change in the future. It also means we cannot safely consider
switching compressors in the future, as we have cornered ourselves by
the design.

Switching to uncompressed files to workaround the unreliable generated
output problem, still only papers over one part of the issue, and defeats
the size savings in common situations (single arch installs).

                             ----

So it really does not seem worth it, it does way more harm than good,
it will generate more overall waste, make transitions and upgrades more
difficult, it makes the M-A:same packages even more asymmetric and
exceptional than they need to be and the size reduction arguments do
not really seem to hold too much, and seems to be the actual overall
more complex solution to the problem.

In addition concerning the mandatory files (copyright and changelog), if
we'd eventually go forward with my proposal to make them actual package
metadata, then dpkg can actually manage them in its db in any way we see
fit, including automatically compressing or refcnt'ing them for example
when they actually match, and as such reducing installed size usage.

Given all the above, I'll be pulling off for now the file refcnt and
version match logic from my pu/multiarch/master branch. If some compelling
arguments are brought up, something I honestly don't really see happening,
then they can be actually reintroduced at any point.


Proposed solution
-----------------

M-A:same packages cannot have any conflicting files with their foreign
counterparts. Thus:

For files in M-A:same packages under a pkgname based path, the pkgname
should always be arch-qualified with the Debian architecture. Most of
these could be handled automatically by debhelper and cdbs, this includes
things like:

  /usr/share/doc/pkgname/
  /usr/share/bug/pkgname
  /usr/share/lintian/overrides/pkgname
  /usr/share/mime-info/pkgname.*
  /usr/share/menu/pkgname
  ...

(Joey, I'm guessing you might consider it too late to do some of these in
debhelper for compat level 9, right?)

For toolchain related files on M-A:same packages, their path should get
arch-qualified using the multiarch triplet, this includes arch-dependent
headers and similar.

The remaining files that are truly arch-independent, like headers, man
pages, development docs, etc, should be split into arch:all package(s),
to something along these lines:

  libfooN-doc
  libfooN-headers
  libfooN-common
  libfooN-common-dev
  libfooN-data
  girX.Y-foo
  ...

Anything else remaining should be considered a bug.


regards,
guillem


Reply to: