Bug#904558: What should happen when maintscripts fail to restart a service

To: Simon McVittie <smcv@debian.org>, 904558@bugs.debian.org
Cc: Margarita Manterola <marga@debian.org>, Sean Whitton <spwhitton@spwhitton.name>, Ian Jackson <ijackson@chiark.greenend.org.uk>, Tollef Fog Heen <tfheen@err.no>, Anthony DeRobertis <anthony@derobert.net>, Gunnar Wolf <gwolf@debian.org>, Stuart Prescott <stuart@debian.org>
Subject: Bug#904558: What should happen when maintscripts fail to restart a service
From: Wouter Verhelst <wouter@debian.org>
Date: Tue, 9 Oct 2018 10:52:15 +0200
Message-id: <[🔎] 20181009085215.GB2825@grep.be>
Reply-to: Wouter Verhelst <wouter@debian.org>, 904558@bugs.debian.org
In-reply-to: <[🔎] 20181007104909.GA15664@espresso.pseudorandom.co.uk>
References: <877elkdkkg.fsf@silentflame.com> <[🔎] 20181007104909.GA15664@espresso.pseudorandom.co.uk> <877elkdkkg.fsf@silentflame.com>

Hi Simon,

Thanks for your summary.

On Sun, Oct 07, 2018 at 11:49:09AM +0100, Simon McVittie wrote:
> Attempting to summarize what was said on this topic in the thread so
> far, and at the last technical committee meeting:
> 
> It's perhaps important to note that we are not discussing ideal situations
> here: any time this conversation becomes relevant, something is already
> wrong. We're aiming to recommend the lesser evil, rather than something
> actually desirable.
> 
> One of the points of view here is Ian and Wouter's assertion that
> whenever a service fails to restart in a maintainer script, the most
> important thing is to make sure the sysadmin pays attention and fixes
> it before proceeding.
> 
> Julien Cristau made another point in support of "failure to restart
> implies failure to configure" on IRC, namely that the only straightforward
> thing for an automated upgrade to do is to look at the successful or
> failed exit status of the package manager (whether that means dpkg,
> apt, unattended-upgrades or whatever), and assume that exiting 0 means
> everything is fine and exiting nonzero means attention is required.

I think this is the core of the issue: it is incorrect to state that
when a service restart was successful, that then everything was fine.
There was a problem. We currently don't have a way to distinguish
between "there was a terrible problem and the sky is going to fall" and
"there was a problem but you might want ignore it", so technically the
only correct thing to do is to exit with a nonzero exit state,
signalling a problem. Put otherwise, I think that if the following
preconditions are true:

1. The service was running before the package upgrade
2. The package's postinst wants to restart the daemon
3. After the package upgrade, the service fails to start again

Then that means the package upgrade broke something, and the system
administrator should be informed of that fact. We currently have only
one *certain* avenue to inform the system administrator, and that is
through producing a nonzero exit state from apt. A debconf error or
message to stdout or stderr would work too in some cases, but the first
is not always shown and the second might scroll by too fast to be
noticeable, so it is not a certain way to tell the system administrator.
As such, exiting nonzero is the only avenue open to maintainers to do
the right thing.

Having said all that...

> At the opposite extreme, Marga's team manages thousands of desktops,
> and having to do *anything* manual to any significant number of them
> doesn't scale. We can think of inexperienced users' desktops as a bit
> like this scenario too, except that instead of having a professional
> sysadmin, they have to ask volunteers for help through channels like
> debian-user and #debian (and those volunteers' help doesn't really scale
> well either). It's also undesirable if the mechanism we use to escalate
> the failure to the user is one that itself makes it harder to diagnose or
> fix the problem, and in particular there's a concern that when packages
> fail to configure, that can make it harder to use apt to install the
> necessary tools to diagnose what has gone wrong; Stuart points out that in
> his experience of helping people in #debian, this is a practical problem.

It is true that there is a larger picture, and that in some
environments, breaking all future upgrades is way more problematic than
not restarting a service once. This is arguably a bug in apt though, and
it feels wrong to me to "fix" such an issue by introducing what is
essentially a workaround in multiple unrelated places; if then the
problem gets fixed properly, we would have to go around the whole system
to undo the workarounds again, which would be a sad state of affairs.

I can think of some alternatives that could be done and that would work
towards a resolution (rather than a workaround) for this problem:

- The policy-rc.d interface could be extended to allow it to signal a
  "restart, but do not fail on error" kind of policy. This would work
  for the "we have thousands of desktops and don't care about a service
  failing to restart" kind of enviromnent.
- Apt could be fixed so that when a package fails to configure, it would
  still be impossible to install and/or configure reverse-dependencies
  of the failing package, but not of packages that are unrelated. This
  would help the "users asking in our support channels can't install
  diagnostic tools to investigate" kind of situation.
- A new state could be created in dpkg to signal "configuration failed,
  but package will work for dependencies". When this is the case, apt
  should inform the user that configuration of some package failed and
  that they might want to investigate, but should not refuse to install
  and/or configure other packages, even reverse dependencies of the
  failing package. This feels right, but I can't come up with a good
  example of the kind of situation which this would fix; perhaps that's
  not a good sign.

Some of these will require more work than others; but "requires more
work" by itself has never been a good enough reason not to do something
in Debian.

> Ian considers it to be design flaw in apt that the actions the user
> can take while a package is unconfigured are so constrained; however,
> we work with the tools we have, not the tools we'd like to have.

I do not think this argument holds merit. By the same argument, the
tools we have are maintainer scripts with nonzero exit state, and we
should keep those and fix the infrastructure around them.

The TC should make a decision based on what the correct technical
outcome is, not based on what the current state of affairs is. If that
means the TC needs to recommend changes beyond what it was originally
asked to do, then it should do so, rather than shirking away from that,
because "the tools just don't work that way". All the tools have source
code, and source code can be fixed.

[...]
> I'm not sure whether we have a concrete example yet of packages at the
> opposite extreme, that are the least important to be able to restart. I'd
> like to propose the game servers that I maintain, like openarena-server,
> as a concrete example here: I hope we can agree that inability to capture
> the flag does not justify getting the package management system into a
> problematic state? :-) (I think this is currently a bug in those packages,
> but I'm not going to fix it until we have consensus here.)

While getting the package management system in a wholly problematic
state is, indeed, a problem, I do think that "failure to restart
openarena-server" might be a critical issue if the only reason you're
paying for a VM or a dedicated server or whatnot is so that you and your
friends (or your customers) can run openarena.

As such, this really depends on the environment, and I think it is wrong
for a maintainer to do anything but signal such failure in the
appropriate way.

> There's a general feeling among the technical committee that a package
> failing to configure is far from a user-friendly way to signal errors:
> Phil's memorable analogy was that it's like telling a car driver that they
> are low on fuel by having the wheels fall off. Historically, we had few
> other ways to manage service failures, and perhaps when all you have is
> a hammer, everything looks like the Failed-Config state; but in a default
> Debian installation we now have a service manager that monitors the state
> of all services at all times (not just when they happen to be upgraded)
> and collects their stderr at all times (not just writing it to the console
> during boot, and dpkg's stderr during upgrades). Even before we considered
> non-sysv init systems, monitoring systems like Nagios were available.

Correct, but it is not correct to state that such monitoring systems are
installed and available on *every* Debian system. If they are, then it
is reasonable to reconfigure the system so that service restart is not
considered a failure; this could be done with the policy-rc.d extension
that I suggested earlier. However, in the absense of such configuration,
the default course of action should be to signal that a problem has
occurred, through the one way available (failing to configure the
package).

[...]
> During the technical committee IRC meeting, we considered whether the
> recommendation to "set -e" in maintainer scripts was consistent with
> considering a maintainer script failing to be a Very Bad Thing. We
> concluded that even if we want to disregard most or all failed service
> restarts, it is still good to "set -e", because if something does go wrong
> (for instance a typo in the maintainer script, a system that is already
> seriously broken, or some other unforeseen circumstance), we want the
> maintainer script to fail safe: stop what it's doing, rather than carry
> on regardless. If a particular failure is something we can reasonably
> predict, reason about and tolerate (as we are arguing failure to restart
> a service is, at least sometimes) then someone should make a conscious
> decision to add "|| true" (or preferably
> "|| some-failure-reporting-mechanism") to that command.

It is unclear to me how a typo in one file (postinst script) trumps a typo in
another file (daemon configuration file causing failure to restart). Care to
explain?

> Finally, here are the debhelper mechanisms that most packages use to
> manage their services, which I think represent the status quo:
> 
> * dh_installinit: defaults to "failure to (re)start is failure to
>   configure", but can be overridden with --error-handler; some packages
>   set the error handler to "true" (e.g. apache2, isc-dhcp) or to a custom
>   shell function (e.g. krb5, samba).

Perhaps the error handler should also be configurable by policy-rc.d, as
I hinted to before.

[...]
> * dh_systemd_start: unconditionally uses "|| true".
>   This is only used for systemd units that *do not* have a corresponding
>   LSB init script. A dh_installinit-style --error-handler would probably
>   be a reasonable feature request.

Same.

-- 
Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008
     Hacklab

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Bug#904558: What should happen when maintscripts fail to restart a service
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#904558: What should happen when maintscripts fail to restart a service
  - From: Wouter Verhelst <wouter@debian.org>

References:
- Bug#904558: What should happen when maintscripts fail to restart a service
  - From: Simon McVittie <smcv@debian.org>

Prev by Date: Bug#904302: Whether vendor-specific patch series should be permitted in the archive [and 1 more messages]
Next by Date: Bug#904558: What should happen when maintscripts fail to restart a service
Previous by thread: Re: Bug#904558: What should happen when maintscripts fail to restart a service
Next by thread: Bug#904558: What should happen when maintscripts fail to restart a service
Index(es):
- Date
- Thread