[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1004557: man-db: please make index.db installations reproducible



Quoting Colin Watson (2022-01-31 03:28:07)
> > But if that's the wrong approach, lets think of the alternative: making
> > sure that the mtimes of manual pages is reproducible. If I use gdbm_dump on
> > the index.db of two different chroots, then it looks like the following
> > manual pages have differing timestamps:
> > 
> > bash-builtins, which, dash, mawk, pager, awk, sh, more, nawk, builtins
> > 
> > Most of those seem to be symlinks into /etc/alternatives and those symlinks get
> > created by maintainer scripts using update-alternatives. Are you suggesting
> > that update-alternatives should gain support for setting the mtime of the files
> > it creates to SOURCE_DATE_EPOCH?
> 
> I think that would at least be worth considering.  It doesn't seem any
> less obvious a thing to do for reproducible installs than hacking mandb
> would be, and it would deal with the problem closer to its source: for
> instance, it would get you closer to being able to produce
> bitwise-identical reproducible images by e.g. tarring up the filesystem,
> which would preserve filesystem mtimes in the image.  (Though I guess
> --clamp-mtime deals with that, but maybe not all image archiving tools
> have something like that?)

I don't think we will ever not need --clamp-mtime when producing reproducible
chroot tarballs. It would mean that every maintainer script would have to be
extended so that every time after a file is created or changed its mtime is
adjusted. I don't think that can fly.

> Another approach might be to modify filesystem timestamps after postinsts
> have finished running but before mandb runs to clamp timestamps to
> SOURCE_DATE_EPOCH; a bit like your proposed patch, but actually modifying the
> filesystem timestamps as well.  I'm not sure where that could go, though.  It
> can't be in mandb because the postinst deliberately doesn't run mandb as
> root; and of course mandb is itself run from a postinst.  Maybe some kind of
> dpkg hook, or maybe it would be simplest to just run a post-processing step
> that clamps all the filesystem timestamps and then runs the equivalent of
> "sudo -u man mandb -cq"?  (This might be more palatable with man-db 2.10.0,
> where this will take more like 10 seconds rather than several minutes; see
> #1003089.)

I don't like the idea of moving functionality like that into chroot-creating
scripts. If we want the chroot to have a certain property, we should add that
to the packages involved using declarative methods.

So another way to fix this would be to add a "touch" call to every maintainer
script calling update-alternatives involving man pages and let them set the
symlink mtime to SOURCE_DATE_EPOCH if that variable is set. But I think that's
a bad idea and we should rather do this centrally.

> > I'm puzzled by bash-builtins though because that one is not a symlink. So I
> > don't understand why the timestamp differs there.
> 
> This puzzled me for a while too, but it's because
> /usr/share/man/man7/builtins.7.gz is a symlink created by
> update-alternatives and references bash-builtins in its NAME, which
> provoked https://bugs.debian.org/691643.  I've now fixed that upstream:
> 
>   https://gitlab.com/cjwatson/man-db/-/commit/37ab864354c1d0ac09e27d2346a1221bf4628509
> 
> This may cause your comparisons to show more differences, but it should
> mean that they're more reliably the *same* differences.  Previously, the
> behaviour depended on directory iteration order (actually usually the
> location of the first physical extent of each file on disk, since mandb sorts
> by that for improved performance on rotational disk drives).

Thanks for the fix!

I talked with Guillem about the possibility of changing update-alternatives to
produce reproducible mtimes. I'm adding debian-dpkg@lists.debian.org to discuss
having a reproducible index.db by changing unattended-upgrades.

Reading the commit you quote above it seems that using the symlink's mtime is
on purpose? I think the problem would not exist if the mtime of the link target
would be used. But there is probably a reason why this is not done already?

Guillem also brought up that using SOURCE_DATE_EPOCH is wrong in this context
because this is about runtime behaviour. I disagree with that assessment. The
idea would be to check whether SOURCE_DATE_EPOCH is set in unattended-upgrades
and only if it is, then change its behaviour. That means that the current
behaviour of unattended-upgrades would be unchanged without SOURCE_DATE_EPOCH
set. Only when building something that needs to be reproducible like a chroot
tarball or system image, SOURCE_DATE_EPOCH would be set. Since building a
chroot tarball or system images is essentially compiling a final artifact from
some other input I think this is completely in line with the idea that
SOURCE_DATE_EPOCH is there to allow creating reproducible build output. In that
sense, unattended-upgrades would be in line with many other tools that respect
SOURCE_DATE_EPOCH and thus differentiate between the scenario where they are
used in the context of some build process (here: creating a chroot tarball) or
normal operation. I don't think that it makes a difference that the input to
the build process here are binary packages and not sources. During normal
package building, build dependencies also do not always provide some human
readable source that is then recompiled but also just binary material that is
then integrated into the final build output.

Guillem was thinking about introducing a new variable in addition to
SOURCE_DATE_EPOCH to indicate that some software should produce reproducible
output in scenarios like this. This would mean that software that already
supports SOURCE_DATE_EPOCH and is called by maintainer scripts now has to be
patched to do the special casing for SOURCE_DATE_EPOCH as well as for the new
variable. I also don't think a new variable is a good idea because I think that
building a reproducible chroot tarball can be well described as some sort of
build process for which SOURCE_DATE_EPOCH makes perfect sense.

We also thought about letting unattended-upgrades use the mtime of the symlink
target as the mtime of the symlink. But this would be a bad idea because backup
software will likely not notice a change of the symlink in case the symlink
switches to a target with a lower mtime.

What do you think?

Thanks!

cheers, josch

Attachment: signature.asc
Description: signature


Reply to: