[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#904558: What should happen when maintscripts fail to restart a service



Attempting to summarize what was said on this topic in the thread so
far, and at the last technical committee meeting:

It's perhaps important to note that we are not discussing ideal situations
here: any time this conversation becomes relevant, something is already
wrong. We're aiming to recommend the lesser evil, rather than something
actually desirable.

One of the points of view here is Ian and Wouter's assertion that
whenever a service fails to restart in a maintainer script, the most
important thing is to make sure the sysadmin pays attention and fixes
it before proceeding.

Julien Cristau made another point in support of "failure to restart
implies failure to configure" on IRC, namely that the only straightforward
thing for an automated upgrade to do is to look at the successful or
failed exit status of the package manager (whether that means dpkg,
apt, unattended-upgrades or whatever), and assume that exiting 0 means
everything is fine and exiting nonzero means attention is required.

At the opposite extreme, Marga's team manages thousands of desktops,
and having to do *anything* manual to any significant number of them
doesn't scale. We can think of inexperienced users' desktops as a bit
like this scenario too, except that instead of having a professional
sysadmin, they have to ask volunteers for help through channels like
debian-user and #debian (and those volunteers' help doesn't really scale
well either). It's also undesirable if the mechanism we use to escalate
the failure to the user is one that itself makes it harder to diagnose or
fix the problem, and in particular there's a concern that when packages
fail to configure, that can make it harder to use apt to install the
necessary tools to diagnose what has gone wrong; Stuart points out that in
his experience of helping people in #debian, this is a practical problem.

Ian considers it to be design flaw in apt that the actions the user
can take while a package is unconfigured are so constrained; however,
we work with the tools we have, not the tools we'd like to have.

We seem to have consensus among the technical committee that it is at
least occasionally appropriate for failure to restart to cause failure
to configure, although this might be the exception rather than the
rule. The examples given where the error path is most important were
packages that provide a system-level API to other packages, so their
failures are likely to cause other packages to fail to configure (such
as local DNS caches and authentication services like LDAP); and packages
that provide remote access, so their failures need to be fixed before a
potentially remote sysadmin logs out to prevent the sysadmin from being
locked out longer-term (like sshd).

I'm not sure whether we have a concrete example yet of packages at the
opposite extreme, that are the least important to be able to restart. I'd
like to propose the game servers that I maintain, like openarena-server,
as a concrete example here: I hope we can agree that inability to capture
the flag does not justify getting the package management system into a
problematic state? :-) (I think this is currently a bug in those packages,
but I'm not going to fix it until we have consensus here.)

There's a general feeling among the technical committee that a package
failing to configure is far from a user-friendly way to signal errors:
Phil's memorable analogy was that it's like telling a car driver that they
are low on fuel by having the wheels fall off. Historically, we had few
other ways to manage service failures, and perhaps when all you have is
a hammer, everything looks like the Failed-Config state; but in a default
Debian installation we now have a service manager that monitors the state
of all services at all times (not just when they happen to be upgraded)
and collects their stderr at all times (not just writing it to the console
during boot, and dpkg's stderr during upgrades). Even before we considered
non-sysv init systems, monitoring systems like Nagios were available.

It's perhaps also worth noting that most services, if they fail during
boot rather than during upgrade, don't cause a drastic reaction.
Historically, initscripts would (attempt to) carry on regardless from
just about any failure mode, including failure of services that ought to
be considered critical-path. With systemd as default, our default init
system does have a more dramatic response to certain failures (going
to an emergency-mode shell), but it only does that for a very limited
subset of services (fsck and mount on required filesystems, according to
the man page).

As Anthony points out, we could benefit from there being a way
for packages to report "something is wrong, but carry on anyway":
continuing to get the system into the least-degraded state possible,
but then arranging for dpkg/apt to exit with a nonzero status so that
automated systems can detect that something is not right. However,
this mechanism does not currently exist. One possible implementation
for the default init system might be an apt Dpkg::Post-Invoke hook that
runs `systemctl is-system-running` and, if the result is not success,
`systemctl list-units --failed`. An init-system-agnostic implementation
would require some other convention for maintainer scripts to signal
partial success (or non-fatal failure, depending how you look at it)
to apt/dpkg.

During the technical committee IRC meeting, we considered whether the
recommendation to "set -e" in maintainer scripts was consistent with
considering a maintainer script failing to be a Very Bad Thing. We
concluded that even if we want to disregard most or all failed service
restarts, it is still good to "set -e", because if something does go wrong
(for instance a typo in the maintainer script, a system that is already
seriously broken, or some other unforeseen circumstance), we want the
maintainer script to fail safe: stop what it's doing, rather than carry
on regardless. If a particular failure is something we can reasonably
predict, reason about and tolerate (as we are arguing failure to restart
a service is, at least sometimes) then someone should make a conscious
decision to add "|| true" (or preferably
"|| some-failure-reporting-mechanism") to that command.

Finally, here are the debhelper mechanisms that most packages use to
manage their services, which I think represent the status quo:

* dh_installinit: defaults to "failure to (re)start is failure to
  configure", but can be overridden with --error-handler; some packages
  set the error handler to "true" (e.g. apache2, isc-dhcp) or to a custom
  shell function (e.g. krb5, samba).
  This is used for LSB init scripts, and for systemd units that have a
  corresponding LSB init script.

* dh_systemd_start: unconditionally uses "|| true".
  This is only used for systemd units that *do not* have a corresponding
  LSB init script. A dh_installinit-style --error-handler would probably
  be a reasonable feature request.

    smcv


Reply to: