Handling of entropy during boot
Hi,
since the getrandom() system call is used more and more, there have been bugs
that services that use it block for a long time at startup and/or get killed
by systemd because they don't start fast enough [1, 2]
There is a random seed file stored by systemd-random-seed.service that saves
entropy from one boot and loads it again after the next reboot. The random
seed file is re-written immediately after the file is read, so the system not
properly shutting down won't cause the same seed file to be used again. The
problem is that systemd (and probably /etc/init.d/urandom, too) does not set
the flag that allows the kernel to credit the randomness and so the kernel does
not know about the entropy contained in that file. Systemd upstream argues that
this is supposed to protect against the same OS image being used many times
[3]. (More links to more discussion can be found at [4]).
But an identical OS image needs to be modified anyway in order to be secure
(re-create ssh host keys, change root password, re-create ssl-cert's private
keys, etc.). Injecting some entropy in some way is just another task that
needs to be done for that use case. So basically the current implementation
of systemd-random-seed.service breaks stuff for everyone while not fixing the
thing they are claiming to fix. Also, the breakage will cause people to invent
their own workarounds which will probably create more security issues than
those that are fixed by the systemd behavior. Therefore I think it should be
the default to credit the entropy of the saved random seed when loading it,
and the special needs of identical OS images used many times should be
documented in the release notes.
A refinement of the random seed handling could be to check if the hostname/
virtual machine-id is the same when saving the seed, and only credit the
entropy if it is unchanged since the last boot.
In case that the random seed file is not present (or the hostname/machine-id
check fails), services may still block for a long time until they start. To
avoid that they are killed by systemd because of timeouts, there should be a
oneshot service that waits for getrandom to unblock and that other services
can use as a dependency. (This is not neccessary with /etc/init.d/urandom
because there are no timeouts).
The systemd maintainers argue that individual services should handle this
problem [1,2]. But this does not scale and the whole point of the getrandom()
syscall is that it cannot fail and that its users do not need fallback code
that is not well-tested and probably buggy. [5]
Cheers,
Stefan
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912087
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914297
[3] https://github.com/systemd/systemd/issues/4271
[4] https://daniel-lange.com/archives/152-Openssh-taking-minutes-to-become-available,-booting-takes-half-an-hour-...-because-your-server-waits-for-a-few-bytes-of-randomness.html
[5] https://lwn.net/Articles/605828/
Reply to: