DEP 17: Improve support for directory aliasing in dpkg
Hi,
I have been looking into the aliasing problems in dpkg on behalf of
Freexian's Debian funding. To that end I proposed a possible way forward
last year (https://lists.debian.org/debian-dpkg/2022/11/msg00007.html),
but the feedback I got was not particularly helpful in determining
consensus. A little later, Simon Richter also looked into the problem
(https://lists.debian.org/debian-dpkg/2022/12/msg00023.html), but
remained silent after the initial post. Little happened since then. Now
Raphael Hertzog proposed to use the DEP process to get this thing
unstuck and with the help of Emilio Pozuelo Monfort I created a draft
for discussion. I allocate number 17 via debian-project@l.d.o. What
follows is the draft text. Please consider it to be a piece of best
intentions at reconciling feedback wherever I could. At the time of this
writing it certainly is not consensus, but consensus is what I seek
here. Without further ado, the full DEP text follows after my name
while it also is available at
https://salsa.debian.org/dep-team/deps/-/merge_requests/5
Helmut
Introduction
============
At its core, `dpkg` assumes that every filename uniquely refers to a
file on disk. The situation where two distinct filenames refer to the
same file on disk is referred to as aliasing. Violating this assumption
leads to undefined behaviour such as file loss. The assumption is
commonly violated when a leading directory component contains a symbolic
link. A situation where this is known to cause file loss happens when a
file is moved from one binary package to another binary package while at
the same time changing the filename in a way that retains its final
location. In this situation, `dpkg` may first unpack the new replacing
location and then remove the replaced package thus unknowingly remove
the aliased file. Other components such as `dpkg-divert` or
`update-alternatives` are likely affected in similar ways.
The purpose of this DEP is selecting and implementing a change to `dpkg`
to improve the way it handles such situations that affect typical Debian
installations.
Naive solution
==============
In theory, `dpkg` could resolve this automatically. For every file it
touches, it could canonicalize the location using the actual filesystem
and check whether any other installed file has the same canonicalized
location. Unfortunately, `dpkg` cannot know which filenames can
collide, so it would check every filename in its database. For
canonicalization, it would `stat()` every component of every filename.
This easily amounts to a million or more `stat()` calls on larger
installations. Caching could reduce the impact somewhat, but since
Debian introduces aliases during maintainer scripts, it would have to
invalidate the cache after maintainer scripts have been run. The
resulting performance would be unacceptable.
Proposal
========
In order to handle aliasing efficiently, `dpkg` gains new options
`--add-alias <symlink>`, `--remove-alias <symlink>` and
`--list-aliases`. When creating symbolic links that cause aliasing
effects, the creating entity is supposed to inform `dpkg` using an
appropriate invocation. Doing so records the aliasing information in a
new mapping inside its administrative directory. No existing
administrative files are modified as a result of this operation. When
`dpkg` operates on paths, it can compute a canonicalized version using a
pure function without the need to `stat()` files on disk thus greatly
improving performance. Canonicalized paths are only needed when
determining whether a file conflict exists. In all other cases,
original paths continue to be used as symbolic links will be followed by
filesystem operations. The `--add-alias` operation records the target
of the symbolic link that must exist prior to invocation. The
`--remove-alias` operation fails if any files are still installed in the
aliased location.
Rejected proposals
==================
Hardcoding aliases into dpkg
----------------------------
It was suggested to include a static aliasing mapping into the `dpkg`
source code. Since `dpkg` is used by multiple projects in different
ways (not necessarily Debian-derivatives), this approach would break
other consumers. Also note that Debian's `dpkg` can be used to operate
on an installation using different aliases via the `--root` flag. As
such the alias mapping needs to be a property of the installation.
Modifying package lists in place
--------------------------------
`dpkg` could rewrite the extracted `.list` files from `control.tar` and
store paths in canonicalized form. Canonicalization would happen as
when a `control.tar` is extracted. It would also happen either as a
one-time conversion during the upgrade of `dpkg` or whenever a `.list`
file is read. Given canonicalized list files, string comparison on
files would support conflict detection. Other pieces to be updated in a
similar way include `alternatives`, `diversions`, `statoverride`, and
`triggers`.
This would affect the output of `dpkg -S`, which would then output
canonicalized paths. Packages generated by `dpkg-repack` would have
their contents canonicalized as well.
Managing the aliasing mapping using a control file
--------------------------------------------------
It was suggested that the mapping could be managed via a special control
file `canonical`. Given that aliasing is not a common operation, the
benefit of handling it declaratively is minor. Beyond that, aliasing
can also happen as an customization issued by an administrator.
Therefore, a command line based approach is preferred.
Having dpkg move files and create symbolic links
------------------------------------------------
When instructed with `--add-alias`, `dpkg` could also create the
corresponding symbolic links and move the affected files to their new
location. While that would be convenient, doing so is non-trivial in an
atomic way. Sometimes, the underlying filesystem does not fully conform
to POSIX (e.g. `overlayfs`) and such corner cases need to be managed
individually. Since such an implementation already exists outside
`dpkg` and its complexity is non-trivial, the moving of files shall
remain external. In case aliases are setup in a bootstrap setting, no
moves are necessary.
Implement aliasing after metadata tracking
------------------------------------------
The [metadata
tracking](https://wiki.debian.org/Teams/Dpkg/Spec/MetadataTracking)
feature enhances `dpkg` with knowledge about filesystem metadata for
installed files. This includes knowledge of symbolic links, which would
help with tracking aliasing. Unfortunately, progress on this is fairly
slow and we think that aliasing support is more urgent.
Proposal internals
==================
A new file `aliases` is added to the administrative directory. Pairs of
lines containing link name and destination indicate an alias. Within
this file, no link name or destination may contain another link name.
The `--add-alias` and `--remove-alias` options change this file only and
must ensure that the properties are retained. This leads to a trivial
algorithm for canonicalizing paths. A given path can be scanned for
recorded link names as sub path and have them replaced with the recorded
destination. This process is repeated until a scan passes without
performing a substitution. Usually, two scan passes will be sufficient.
Much of the internal work has been prototyped by [Simon
Richter](https://salsa.debian.org/sjr/dpkg/-/tree/wip-canonical-paths)
and can be used. It demonstrates how the `fsys_namenode` can be
augmented with a canonicalized path and how `fsys_hash_find_node` can be
extended with a new flag to differentiate between lookups considering
aliases or exact names. It differs from what is proposed here in the
API to configure aliases and in possibly storing partially canonicalized
versions of file names.
Reply to: