[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: support for merged /usr in Debian

On 01/03/2016 12:03 PM, Daniel Reurich wrote:
> On 03/01/16 23:18, Ben Hutchings wrote:
>>> Then why is it that since the introduction of systemd is having /usr on
>>> a separate partition suddenly considered evil and systemd complains
>>> loudly about it.  It always has worked and does work fine for me with
>>> sysvinit
>> systemd complains if it has to mount /usr itself.  This is because
>> mounting of local filesystems generally depends on various services and
>> udev hooks that may themselves depend on /usr.  This is also true when
>> using sysvinit.  Some services go through contortions to work before
>> /usr is mounted; others may behave subtly differently if it's a
>> separate filesystem.  We really need a simplified code path for
>> mounting /usr early on, and that is now provided by the initramfs.
> Ah, so it's actually packages that don't separate device configuration
> logic from the application or daemons properly that has caused the
> brokenness.  Can we identify and fix the packages that cause this issue?

Because that doesn't work in practice. Case in point: #777547:

I found this bug nearly a year ago (Jessie was not released yet!) while
testing something completely differently (I'm not using that setup, I
just stumbled upon it): you're using sysvinit as init system (or
potentially systemd without initrd, but I haven't tested that), and
have /usr on NFS, but / locally. This is _exactly_ the kind of corner
case that I would have expected from reading this thread that people
are interested in - but apparently that is not the case because not
only do I appear to be the only one to have found this problem (present
since August 2nd, 2014!) but nobody thought this was important enough
to fix, even after the release of Jessie.

And this is just one example. There are lots of other packages that
could in principle be required for mounting /usr in some setups but
aren't installed in the rootfs directly - even in Jessie today.

> Is this also something to do with the inherent lack of determinism and
> parallelization in systemd's startup as well (just out of interest)?

Well, it's not systemd's fault here, but it's not completely unrelated
to parallelization.

If you look at it historically, the earliest systemd versions assumed
that / and /usr could be separate and that was what the systemd
developers assumed would remain the case - up to systemd v19. The
warning that /usr is separate was introduced in the commit
80758717a6359cbe6048f43a17c2b53a3ca8c2fa between v19 and v20. The
problem was that setups started to break, especially on other distros
that adopted systemd earlier. Why? Because systemd will start
everything in parallel that isn't explicitly ordered (which I would
argue is a good thing, because it makes boot faster) - but a lot of
ordering on sysvinit systems, especially w.r.t. the presence of /usr,
was implicit, not explicit. Debian was historically generally better at
ensuring explicitness even before the systemd adoption (via proper LSB
headers) as compared to other distributions, but it wasn't perfect
either (far from it).

Of course, at this point, you can argue that you _should_ fix all the
broken packages (systemd itself at that point _wasn't_ broken on split
/usr systems, just the things it executed - and as Debian Jessie shows
systemd itself still works on those types of systems), but:

 - there are a *lot* of packages that are affected by this, because
   you have to consider all the library dependencies of a package. This
   cascades down: say you have software A required in early boot that
   requires library B. Library B in version 42 now has a new dependency
   on library C - library C will now need to be moved to / from /usr;
   even though the maintainer of library C might never have expected
   there to be an early-boot use case for that library. This is a huge
   maintenance burden on a lot of people.

   It's not just filesystems either - anything that installs a udev
   rule is affected; (udev itself is fine w/o /usr mounted) because if
   there's hardware already plugged in that requires udev rule
   processing with things in /usr, this currently either fails
   completely, OR there needs to be some hack that defers the execution
   of this until later in the boot process (OR the package needs to go
   in /).

 - the rootfs will be quite large, making the split from /usr less and
   less attractive: all library dependencies of all filesystems which
   could *possibly* carry /usr need to be in / now - and I've seen FUSE
   filsystems written in C++ and Python. Ok, not all FUSE filesystems
   are useful for /usr, but there is probably some filesystem out there
   that requires either of those dependencies.

   The problem here is that the cost for the larger rootfs is not
   incurred only for those people using weird filesystems for their
   setups, but for everybody else that has software installed that uses
   either C++ or Python (MOST Debian users I presume) because those are
   now in /.

   (In Jessie, if you use D-I to install a system with ONLY the SSH
   server task selected, you'll get 420 MiB usage on /, of which
   162 MiB are kernel modules and 36 MiB /boot, and 224 MiB usage on
   /usr. That means that / even without the kernel modules is nearly
   as large (222 MiB) as /usr (224 MiB) for a minimal Jessie system.)

 - you have to be *really* careful when scripting things in early boot
   before /usr is mounted, because a lot of useful things aren't
   available in /.

   How many people know that awk isn't in / but in /usr? I didn't until
   I actually tried an init script that was supposed to work in early
   boot actually in early boot without /usr mounted; and then I figured
   out that the script broke completely because awk was missing... But
   the script worked perfectly while testing on a running system.
   (Hint: sed is your friend in those cases, because that is in /.)

It's not that these issues CAN'T be solved, but it's that these things
take up a lot of manpower and time for a lot of people, and the current
state of things is a perfect indication for the fact that the current
way of doing things is just not reasonably maintainable.

So what happened on the systemd side at this point? They added a
warning that the current setup might suffer from subtle breakage. But
that isn't a solution in and by itself, that just helped people debug
things. (Note that a lot of subtle breakage already occurs on sysvinit,
people just didn't notice it. systemd just brought the problem more to
light in some more obvious cases, because it breaks assumptions about
implicit ordering of things.)

What the systemd developers recommended as a solution was to use the
initrd. Because it is used for _exactly_ the same thing: a small
filesystem containing the necessary binaries to bring up the rest of
the system. In that case, it was originally meant for the root
filesystem only, but the logic is the same: one can just also mount
/usr in the initrd and everything will work. Also, since all these
filesystem types / storage solutions already need to support initramfs
for the root filesystem. And that support is much more likely to be
well-tested than the support for a separate /usr mounted from within /.
And in contrast to the rootfs the initrd is still really small.

So that was the state in February of 2011, when the warning was added
to systemd and the systemd developers recommended the use of the
initrd: mounting /usr from a running system is broken. Either it is
already completely broken in some cases - and for all other cases where
it currently works it is broken maintenance-wise.

Now you can say: well, mounting /usr from a running system works for
me - but you should keep in mind that Debian's current support for this
comes at a HUGE cost for the entire ecosystem, and I haven't seen
anybody keen on fixing all the issues that currently cause breakage,
even if it is just subtle. See the bug I reported at the beginning of
the email: nothing happened in nearly a year and it took me testing an
obscure corner case of some *other*, *unrelated* piece of software to
even stumble upon this bug - and I only did that because I wanted to be
extremely thorough - I doubt other people try so many different things
as I did back when I found that bug. (Even I don't regularily do that.)

Note that this is all independent of the UsrMerge business - this is
just recognizing the fact that the way things have been done before is
not something that will work in the future. And it doesn't really have
to do much with systemd (all bugs w.r.t. this are not in systemd, they
are in other software and/or packaging thereof), it has just been the
messenger for this for the past 5 years. And even if Debian had stuck
with sysvinit (or chosen upstart), it would STILL be a good idea to
have split-/usr premounted in initrd instead of desperately trying to
support something that has been coming apart for at least the last 10
years or so, even if some people haven't noticed it yet.

To provide something more constructive here:

So what stops people from using an initramfs? The only reasonable
arguments I've seen are the following:

 * it's black magic (too complicated)
 * it's too slow (loading it from the boot loader, execution of the
   initrd itself)

So what do you do in the case where (for whatever reason) you have to
have a separate partition for / and /usr, but don't want to use the
existing initrds (initramfs-tools, dracut) because of one or both of
the above reasons?

Well, just for the heck of it I wrote a braindead-simple initrd
implementation in just 300 LOC:


It tries to mount / and /usr (taking root= and x-mount.usr= from the
kernel command line) and then just switches root and exec's init. It
doesn't support much, but it's only requirements are that devtmpfs
is compiled into the kernel, as well as all things needed to mount the
rootfs and /usr, because it doesn't load any modules. devtmpfs is
required because we don't want to be in the business of creating device
nodes ourselves (and listening to netlink for these kinds of events is
awful and could possibly conflict with udev).

I tried this in a Jessie VM with a custom kernel (virtio-blk and ext4
compiled in, otherwise identical to the Jessie kernel) and it
successfully boots an otherwise unchanged Jessie installation.

This is just a proof of concept, but it has the following properties:

 - compressed initrd is just 6742 bytes (!) of size; this is much
   smaller than ANY kernel image and even smaller than any typical
   initrd (which are 1000x larger). Unless you are using network boot,
   where you incur additional latency for additional files, this should
   be *really* fast to load from a bootloader. (I haven't measured it

 - because it's only a binary and three directories (see the top of the
   source file on how to create an initrd out of it), it doesn't need
   to be regenerated for different kernel versions (it doesn't load
   modules) - no penalty if you install new kernels, just a simple cp
   of a tiny file (smaller than this email).

 - because it's extremely simple plain old C code (no shell) and
   doesn't do anything fancy, it should be *really* fast. I did a
   small test with my VM, rebooting it 111 times (each). With no
   initrd, I got 0.92s boot time (systemd-analyze total time) with a
   stddev of 0.10s. With Debian's default initrd I got 1.21s with 0.28s
   stddev. And with my PoC initrd I got 0.89s with 0.10s stddev. (This
   doesn't include the time the boot loader took to load it.) So my PoC
   appears to be roughly of the same speed as not booting with initrd
   at all and letting systemd mount /usr.

Note that it's absolutely brain-dead and doesn't support fancy
root=UUID=... or the such - you need to really specify the device node
of the kernel (symlinks like /dev/disk/by-*/... also won't work, as
they are created by udev). The general idea is that for systems that
have a separate /usr that is still reachable for just the kernel, it
could provide a migration path from initramfs-less systems without the
need for a full initrd. Obviosuly this will only work for the simplest
of use cases, but may be sufficient for the people complaining here?

My question would be: would those people here who have separate /usr
and aren't using initrd be willing to put up with something like that?

If no additional features will be added to this - other than maybe
parsing /etc/fstab of the system to figure out the /usr mount (instead
of requiring an additional kernel option), and improving error
reporting maybe a bit - would you be willing to use something like
that? Covering the use case for a separate /usr partition for when
you'd previously not use an initrd at all for booting.

(If /usr is on the same partition as / initrds are obviosuly still

If people would say "yeah, ok, I'd be fine with something like that"
and be fine with using this initrd and finally put the question about
mounting /usr there to rest, then I'd be willing to put in some work
and package it for Debian.


Attachment: signature.asc
Description: OpenPGP digital signature

Reply to: