[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Decreasing packaging overhead



Thomas Goirand wrote:
> But good luck to teach good practices upstream. See Ross's reply: 120
> packages are depending on this.

It's more than that.  Given tooling that doesn't have excessive overhead
for small packages, why call such packages "bad practices" in the first
place?

> Though it is also my view that packaging tiny stuff shouldn't be a
> problem. If it is, then we should fix whatever it is that is problematic
> in Debian infra.

Agreed.

Let's consider what overhead exists for a Debian package, and what we
could potentially reduce or remove, using node-defined as an example.
(Obviously any such changes to metadata may require a full Debian
release to propagate changes to tools like apt and dpkg.)  To make
redundancy more evident, I'll include everything first before discussing
any of it.

First, an entry in Sources that looks like this, for each Debian suite
(unstable/testing/stable/oldstable):

Package: node-defined
Binary: node-defined
Version: 1.0.0-1
Maintainer: Debian Javascript Maintainers <pkg-javascript-devel@lists.alioth.debian.org>
Uploaders: Ross Gammon <rossgammon@mail.dk>
Build-Depends: debhelper (>= 9), dh-buildinfo, nodejs
Architecture: all
Standards-Version: 3.9.6
Format: 3.0 (quilt)
Files:
 43ab019e6b53b9f4d4ff338027cb351d 1997 node-defined_1.0.0-1.dsc
 978d30ee28482aa7812f74f812b1899f 2334 node-defined_1.0.0.orig.tar.gz
 557f4bcec8a449608e50d09ba69bd224 2416 node-defined_1.0.0-1.debian.tar.xz
Vcs-Browser: https://anonscm.debian.org/cgit/pkg-javascript/node-defined.git
Vcs-Git: git://anonscm.debian.org/pkg-javascript/node-defined.git
Checksums-Sha1:
 02cb2027e3218b93fd856a5e3b68134fe01e47c1 1997 node-defined_1.0.0-1.dsc
 eff888bf76f9cfcca2b94e39c470a6c1441b3f03 2334 node-defined_1.0.0.orig.tar.gz
 7237a9a8aee2add44a9d8bb0dae382c3f0a923cf 2416 node-defined_1.0.0-1.debian.tar.xz
Checksums-Sha256:
 4aa2a079bc7119678c58643def268e4789b56a6a40b2931601de527244a1def8 1997 node-defined_1.0.0-1.dsc
 d953e6e9fe9277cc6e68e5bb36a299d8f3505f8facd3468ab7edc7d6858d293a 2334 node-defined_1.0.0.orig.tar.gz
 56ede623ee7929fcb334fa7459c3e3f43b529bf2b585866d5ebc9ee06cc3d03d 2416 node-defined_1.0.0-1.debian.tar.xz
Homepage: https://github.com/substack/defined
Package-List: 
 node-defined deb web optional arch=all
Testsuite: autopkgtest
Directory: pool/main/n/node-defined
Priority: extra
Section: misc

Second, an entry in *each architecture's* Packages file like this, for each
Debian suite:

Package: node-defined
Version: 1.0.0-1
Installed-Size: 19
Maintainer: Debian Javascript Maintainers <pkg-javascript-devel@lists.alioth.debian.org>
Architecture: all
Depends: nodejs
Description: return the first argument that is `!== undefined`
Homepage: https://github.com/substack/defined
Description-md5: b4200f8f2e989c1354c3c1cb3677e663
Section: web
Priority: optional
Filename: pool/main/n/node-defined/node-defined_1.0.0-1_all.deb
Size: 3292
MD5sum: d5a08f2219b4128a49be206caeb5b8b4
SHA1: 115317d45d5028203269d84aa07c447d7c12ea7b
SHA256: 5be875d209afc69aa2d6be10bbed3c514e75f0a5e8d5a769a6461f42ab6db581

(Note that a source package with multiple binary packages would have multiple
such entries.)

Third, an entry in Translation-en (and every other translation), for each
Debian suite:

Package: node-defined
Description-md5: b4200f8f2e989c1354c3c1cb3677e663
Description-en: return the first argument that is `!== undefined`
 Most of the time when you chain together ||s, you actually just want the
 first item that is not undefined, not the first non-falsy item.
 .
 This module is like the defined-or (//) operator in perl 5.10+.
 .
 Node.js is an event-based server-side JavaScript engine.

Fourth, the source package .dsc file:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 3.0 (quilt)
Source: node-defined
Binary: node-defined
Architecture: all
Version: 1.0.0-1
Maintainer: Debian Javascript Maintainers <pkg-javascript-devel@lists.alioth.debian.org>
Uploaders: Ross Gammon <rossgammon@mail.dk>
Homepage: https://github.com/substack/defined
Standards-Version: 3.9.6
Vcs-Browser: https://anonscm.debian.org/cgit/pkg-javascript/node-defined.git
Vcs-Git: git://anonscm.debian.org/pkg-javascript/node-defined.git
Testsuite: autopkgtest
Build-Depends: debhelper (>= 9), dh-buildinfo, nodejs
Package-List:
 node-defined deb web optional arch=all
Checksums-Sha1:
 eff888bf76f9cfcca2b94e39c470a6c1441b3f03 2334 node-defined_1.0.0.orig.tar.gz
 7237a9a8aee2add44a9d8bb0dae382c3f0a923cf 2416 node-defined_1.0.0-1.debian.tar.xz
Checksums-Sha256:
 d953e6e9fe9277cc6e68e5bb36a299d8f3505f8facd3468ab7edc7d6858d293a 2334 node-defined_1.0.0.orig.tar.gz
 56ede623ee7929fcb334fa7459c3e3f43b529bf2b585866d5ebc9ee06cc3d03d 2416 node-defined_1.0.0-1.debian.tar.xz
Files:
 978d30ee28482aa7812f74f812b1899f 2334 node-defined_1.0.0.orig.tar.gz
 557f4bcec8a449608e50d09ba69bd224 2416 node-defined_1.0.0-1.debian.tar.xz

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJWKj8IAAoJEPNPCXROn13ZrhwP/1+FQtC5NIM1SAWj8capx3Sm
rdLtO29o+M7mSaiN7c10IYn+OXFu+AMFikVnD4+6Jzj3qtWfk6sgRWBsU2IXQ9Br
xUj8pskB5t2Ti8aAzoId3wKxgOL9JF9u6b7MzkER1WOXOOMjmT16OASRjx1vmJSh
OrDHJKJN2n8KIoJerWQ3d9GazCFgQZ3HfgDXWUeupkWG8emGoyvScpscsab1Mdq9
BvA5X5k4XCGalIeEXAbrx4wR6dHLfldEY/K0g3RyLmicPZbcHeMeaEBOSRkIsr7W
yjzcdz7T2TAdbxG1ZOzumDcpEUKEZDSKFZwysaccyGpPsts9ZGYU5HeM5MdwJzKc
6fobHtC+ARzgcp8Fxq1xitO/zfQnJA5eUbWMykjLsf8LeZI0/g1VFVnxv2cfSHNP
dh/OrGNtPAWaJgwsb/LwR2d+WinAYMocTO6n9D3ONyV6OrvVi81fRWcp24Mo4rDH
0oDG4vaZyeyKyDenHJzFCm2AlZ7pnosFx96aIHOmEeMwE0/xMedjzaE1sbWd4/Ma
rf6xOVI+Tqj+YYMLLC6+dP6gNzx3qTTBBOVijotllxNGzjKUgOR0jP0RSsNyXAW/
QPYz5aftp0icn5nEEeXfjfrcclOtrAAGH7wiMKiNT99YgI/zJwHetBuESVBJ3OOT
XZmmN/c/EAnp8AWFYJuy
=U3sK
-----END PGP SIGNATURE-----

We can skip the .orig.tar.gz; that's the package itself, not overhead.

Fifth, the contents of the .debian.tar.xz:

debian/
debian/tests/
debian/tests/require
debian/tests/control
debian/docs
debian/upstream/
debian/upstream/metadata
debian/watch
debian/copyright
debian/examples
debian/changelog
debian/control
debian/compat
debian/rules
debian/install
debian/source/
debian/source/format
debian/gbp.conf

Of those, the files with the most significant overhead or duplication
include debian/control, debian/changelog, debian/copyright,
debian/tests/control (could be reduced or eliminated via conventions),
debian/gbp.conf, and debian/upstream/metadata.  (Some of the rest could
be reduced or eliminated via conventions as well, though.)

And sixth, the files in the .deb:

drwxr-xr-x root/root         0 2015-10-23 06:59 ./
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/share/
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/share/doc/
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/share/doc/node-defined/
-rw-r--r-- root/root       158 2015-10-21 07:27 ./usr/share/doc/node-defined/changelog.Debian.gz
-rw-r--r-- root/root      1442 2015-10-21 07:27 ./usr/share/doc/node-defined/copyright
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/share/doc/node-defined/examples/
-rw-r--r-- root/root       123 2015-03-30 15:47 ./usr/share/doc/node-defined/examples/defined.js
-rw-r--r-- root/root      1082 2015-03-30 15:47 ./usr/share/doc/node-defined/readme.markdown
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/lib/
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/lib/nodejs/
drwxr-xr-x root/root         0 2015-10-23 06:59 ./usr/lib/nodejs/defined/
-rw-r--r-- root/root      1094 2015-03-30 15:47 ./usr/lib/nodejs/defined/package.json
-rw-r--r-- root/root       150 2015-03-30 15:47 ./usr/lib/nodejs/defined/index.js

The files in /usr/lib/nodejs are the contents of the package; they don't count.
Examples and the upstream readme are at least arguably useful.  However,
copyright and changelog.Debian.gz are Debian overhead.

In this and all following discussions, given the use of compression, we
can mostly assume that field names in aggregated control files take
almost no space; their existence at all does add a tiny amount of
overhead, but we wouldn't save any space by reducing the lengths of
field names, only by eliminating fields entirely or reducing their
unique content.

However, as far as I can tell, we could reduce the per-package overhead
and redundancy quite a bit.  A few examples:

"Binary" seems a bit excessive for several reasons.  First, it seems
redundant with the "Source" entries in Packages files; we don't
necessarily need a two-way cross-reference at all here.  And second, we
could assume that a missing entry means "same as Package".  That rule
(source equals binary) would work for 13364 of 24097 packages in Debian
today, and potentially more if other single-binary packages ensured
their source and binary names matched.

For that matter, Binary and Package-List seem redundant.  (And
Package-List doesn't seem like end-user metadata; it seems like
something only the Debian infrastructure needs.)

Many fields, such as Maintainer and Uploaders, seem unnecessary to
extract into aggregated files; their presence in *one* place should
suffice.  Developer tools just need these in debian/control in the
source package.  The Debian infrastructure does need them, but end-users
mostly don't, and they especially don't need to download them as part of
the aggregated package metadata.

Do we really need fields like Build-Depends, Testsuite, or
Standards-Version pulled out of the package itself and placed into the
Sources file?  Why do we need to read those without the source package?
(Note that tools that form part of Debian infrastructure could work from
UDD or similar; the question is why those fields are needed on an
end-user system that downloaded the Sources file.)

Files, Checksums-Sha1, and Checksums-Sha256 are clearly redundant; has
it been long enough that we can drop the first two yet?

Now that we use a secure hash, do we really need the sizes in those
fields?  Furthermore, we could generate the filenames from the source
name and version.  And finally, all but the dsc seem redundant with
fields in the dsc.  So we could really reduce this down to a secure
checksum of the DSC.

Homepage really doesn't need to live in both files.

Format doesn't need pulling out; tools could just parse that from the
dsc file.

Directory seems entirely derivable; if we want to support a variety of
repository layouts, we could put repository layout information into the
Release file.

Priority and Section seem not only redundant between source and binary,
but actively wrong: note that they differ in this case, yet the source
builds only one binary.  extra/misc seems wrong; optional/web seems
correct (at least until we establish a "js" section).

In the Packages files for binaries, we could eliminate a *massive*
amount of redundancy by having a dedicated Packages file for "all", to
avoid duplicating entries into every architecture's Packages file.  That
should not significantly increase overhead for end-users, and for any
user of multiarch it'll decrease overhead.  A quick check on amd64 shows
that splitting out "all" into a separate Packages file would not change
the combined uncompressed size at all, should not change the pdiff size
at all, and would increase the combined compressed full-download size by
94k, from 9957k to 10051k, an increase of less than 1%.  That seems
reasonable in exchange for eliminating 12 duplicate copies of the 4396k
used for "all" Packages files, times suites
(oldstable/stable/testing/unstable/experimental), and that doesn't even
count unofficial architectures, or snapshot.debian.org.

Ditto for translated descriptions, except that there, we should share
descriptions across architectures by default, even for arch-specific
packages.  Almost no packages have descriptions that vary by
architecture.

For Packages, we have a similar waste of space storing md5 and sha1
hashes, and .deb package size.  Likewise for dsc files, in addition to
the mostly derivable filenames.

For translated descriptions, Package and Description-md5 seem redundant.

In the dsc, we have a similar redundancy between Source, Binary, and
Package-List.  And even if the section/priority/etc information made
sense in Package-List, it gets overridden with the canonical information
provided by the archive.

That's not even getting into the more controversial items, like
debian/changelog (a vestige of a pre-VCS era), or debian/copyright.  Or
more fundamental changes, like stuffing absolutely everything into a
single git repository for deduplication and incremental downloads.

- Josh Triplett


Reply to: