Which package is responsible for setting rlimits?

To: debian-devel@lists.debian.org
Cc: pam@packages.debian.org, systemd@packages.debian.org, debian-kernel@lists.debian.org
Subject: Which package is responsible for setting rlimits?
From: Simon McVittie <smcv@debian.org>
Date: Mon, 1 Feb 2021 13:58:57 +0000
Message-id: <YBgJIUVs/IeDgUYR@momentum.pseudorandom.co.uk>

A recent regression in gnome-keyring (perhaps only on systems that
use dbus-x11, it isn't completely clear to me yet) has prompted me to
look at how rlimits work in Debian. It isn't clear to me which package
is or should be responsible for choosing what arbitrary limits we use
in practice.

The kernel has some defaults, which it sets on pid 1. Some are hard-coded,
but increasingly many seem to be dependent on system state (for example
limiting memory sizes to a fixed fraction of system RAM). Traditionally,
when the init system was extremely minimal and delegated the majority
of its responsibilities to child processes (sysvinit or similar),
these defaults would be inherited by pid 1's children, and recursively
inherited by user processes.

In principle, the pam_limits.so module sets rlimits for user
processes. However, by default it is unconfigured, and in the
absence of configuration it needs to default to *something* - either
inheriting from its parent process, or resetting the limits to something
predictable. Inheriting from its parent process is problematic because the
parent process might have reset its limits internally, and in a sysvinit
world it might have been restarted by a sysadmin in an arbitrary execution
environment, leading to unpredictable limits in user processes; but
resetting the limits is also problematic, because it results in PAM
having to second-guess the limits coming from the kernel, which presumably
knows better.

Debian's PAM package currently carries a non-upstream patch to
screen-scrape the rlimits of pid 1 and use them as a guess at what the
kernel's defaults must have been. This makes perfect sense in a sysvinit
world, where sysvinit hardly does anything (the real work of booting the
system is all delegated to sysv-rc) and therefore is unlikely to need
to raise its rlimits; but it doesn't really make sense under systemd,
where pid 1 does a significant amount, and raises its rlimits accordingly.

systemd *also* has configurable default limits to be passed down to
system services (see DefaultLimitMEMLOCK, etc. in /etc/systemd/system.conf).

How is this meant to work, and is it working as intended in practice?
If I'm understanding correctly, upstream it's meant to go something
like this, with more-indented components selectively overriding
less-indented components:

    kernel ->
        (kernel defaults)
            init ->
                (systemd's configuration, if using systemd)
                    system service providing an entry point ->
                        PAM stack, pam_login.so ->
                            (pam_login configuration, if used)
                                user sessions

but because sysadmins of sysvinit systems are expected to run
"service foo restart" in an unknown execution environment,
our patched PAM changes this to:

    kernel ->
        (kernel defaults)
            init ->
                (systemd's configuration, if using systemd)
                    system service providing an entry point ->
                        PAM stack, pam_login.so ->
                            (PAM's best guess at what the limits *should
                            have been*)
                                (pam_login configuration, if used)
                                    user sessions
                    system service providing an entry point ->
                        ... sysadmin's arbitrary login session... ->
                            system service restarted by sysadmin ->
                                PAM stack, pam_login.so ->
                                    (PAM's best guess at what the limits
                                    *should have been*)
                                        (pam_login configuration, if used)
                                            user sessions

I wonder whether the solution ought to involve something like this:

* On non-systemd-booted systems, PAM continues to screen-scrape limits
  from pid 1 for compatibility with the "service foo restart" use-case;
* On systemd systems, PAM stops doing that, and inherits from the parent
  process by default, resulting in user processes getting the limits
  configured in pam_limits (if set), or if not set there, then the limits
  from systemd system.conf (if set), or if not set there either, the limits
  from the kernel

Rationale: on sysvinit or runit systems, pid 1 is very simple and is
unlikely to need to elevate any limits, but sysadmins are expected
to restart system services in an unpredictable execution environment
(certainly true for systemd, I'm not so sure for runit). On systemd
systems, pid 1 is more complex, but part of the value we get for that
complexity is that even when sysadmins restart system services, the
service receives a known and predictable execution environment, so it
does not need to be robust against inheriting a wrong rlimit or other
parameters.

See also #917374, #976373, #923312.

The reason I ask about this is that I want to make sure we are setting
rlimits, and in particular RLIMIT_MEMLOCK, to a realistic value for 2021.
The wider context here is that gnome-keyring-daemon, GNOME's implementation
of the org.freedesktop.Secrets interface, is currently setcap
cap_ipc_lock=ep so that it can mlock(2) secrets and stop them from getting
swapped out. This is ineffective on systems that can hibernate, at which
point everything (even locked memory) has to be written to swap in any case,
but it's better than nothing.

This filesystem capability results in gnome-keyring-daemon having elevated
privileges (even though the privilege is a relatively minor one), which
in principle means it should not trust the execution environment inherited
from a less-privileged caller that might be trying to trick it into
executing attacker-provided arbitrary code with the elevated privilege.
Recent security-hardening changes in GLib made it distrust most environment
variables to reduce the number of foot-shooting incidents (although authors
of setuid or privileged components should note that GLib's maintainers
still consider it to be the setuid program's responsibility to sanitize
its own execution environment, since it is the setuid program that is
setting up an unusual trust relationship).

However, in order to work as designed, gnome-keyring *has to* be able to
trust environment variables that it inherits: either
DBUS_SESSION_BUS_ADDRESS, or XDG_RUNTIME_DIR, or both. Otherwise, it
cannot connect to the D-Bus session bus and provide its intended
functionality. (#981420, #981555)

Historically, gnome-keyring's RAM also contained GPG keys (although it
now delegates GPG key handling to GnuPG's gpg-agent and does not ever see
a decrypted GPG key itself), and also contained SSH keys (although if I
understand correctly, it now delegates *those* to OpenSSH's ssh-agent,
and does not ever see a decrypted SSH key itself). Those are obviously
also desirable to lock into RAM, although I note that neither gpg-agent
nor ssh-agent has CAP_IPC_LOCK, so presumably they find the default
RLIMIT_MEMLOCK sufficient for their needs.

Empirically, it seems that user processes on a Debian 11 system booted
with systemd have a RLIMIT_MEMLOCK of 1/8 of RAM (which is definitely
plenty, but perhaps too much: #976373); user processes on a Debian 11
system booted with sysvinit have a RLIMIT_MEMLOCK of 64K (perhaps not
enough); and user processes on a Debian 10 system booted with systemd
had a fixed RLIMIT_MEMLOCK of 64M (perhaps more like the right value).

I would like to have gnome-keyring inherit some realistic
RLIMIT_MEMLOCK that is enough for it to lock passwords into RAM
without special privileges, without either having to rely on some
lowest-common-denominator behaviour or being a denial-of-service
vector. If that only happens under systemd, so be it - we can document
in README.Debian that if sysvinit users want to lock passwords into RAM,
they will have to configure pam_limits themselves - but I would hope that
given the number of sysvinit advocates in the project, there is someone
who can implement reasonable behaviour for sysvinit systems too.

We have seen a similar mess in the past with arbitrary file descriptor
limits, where nobody is quite sure which component is responsible for
setting the limit to be realistic for 2010s/2020s hardware that can
certainly cope with managing more than 4K file descriptors at a time.

    smcv

Reply to:

Follow-Ups:
- Re: Which package is responsible for setting rlimits?
  - From: Simon McVittie <smcv@debian.org>
- Re: Which package is responsible for setting rlimits?
  - From: Sam Hartman <hartmans@debian.org>

Prev by Date: linux-latest_105+deb10u9_source.changes ACCEPTED into proposed-updates->stable-new
Next by Date: Re: Which package is responsible for setting rlimits?
Previous by thread: linux-latest_105+deb10u9_source.changes ACCEPTED into proposed-updates->stable-new
Next by thread: Re: Which package is responsible for setting rlimits?
Index(es):
- Date
- Thread