[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#904558: What should happen when maintscripts fail to restart a service



Hi,

On Wed, Oct 17, 2018 at 09:47:57PM +0100, Simon McVittie wrote:
> However, it leaves the default as "fail hard", which I'm not convinced
> is the most appropriate thing for systems that lack an experienced
> sysadmin (which are the systems where defaults matter most, because an
> inexperienced user is the least able to make an informed decision about
> where they should deviate from defaults).

I think that's where we disagree, so allow me to focus on that.

I think everyone would agree that when a service fails to (re)start upon
package installation or upgrade, that there is a problem and that this
problem needs to be reported in whatever way is most appropriate (if
not, we have a bigger disagreement than I thought and we need to take a
step back ;-)

The question that remains is "how". Currently, Debian has four ways of
informing a system administrator of such failures:

- Log a message to stdout and/or stderr. This is liable to scroll by
  unnoticed, and therefore is not a reliable way to inform the system
  administrator. For that reason, I don't think it's a good idea.
- Log a message to syslog and/or the systemd journal. This will not
  scroll by, but relies on the system administrator to actively hunt for
  problems in system logs, which they probably won't do unless and until
  they notice that the daemon isn't running anymore (and by that time it
  may be too late).
- Produce a debconf error note. This is mildly better than the above
  two, since debconf error notes are shown at highest priority, and
  therefore will only be hidden if debconf is configured to be
  noninteractive; in that case, debconf will send an email to root. On
  systems without a configured MTA, this will not help; and for daemons
  where failure to restart is a catastrophic that needs to be resolved
  ASAP, such as sshd, this might not be desirable.
- Exit from postinst with nonzero exit state. This is unlikely to be
  missed by system administrators; however, it has several disadvantages
  that were pointed out by other people during this discussion.

I think it is perfectly fine to have the TC say that "failures to
restart a service must be reported, either by exiting nonzero, or by
another appropriate action", without going in detail what those other
actions could be.

> policy-rc.d also has some practical integration issues. It normally relies
> on putting an unpackaged file in /usr/sbin (unless you have installed
> policyrcd-script-zg2), and it's common for tools like debootstrap and
> debian-installer to create and delete policy-rc.d to suppress service
> startup while carrying out bootstrap operations. One Debian derivative
> that I'm involved in (SteamOS) is *meant* to have a policy-rc.d, but we
> recently discovered that it has always been deleted at the end of the
> debian-installer run, and so doesn't exist in practice.

I think that problem is not something that should be resolved by this
discussion.

I'll readily admit that I did not actually test any of the suggestions I
made wrt policy-rc.d. There are other issues with it too; I'm thinking
of filing a wishlist bug to have it replaced by something better.

On top of that, policy-rc.d has alwyas irked me as a bit of an awkward
interface; it is the only type of Debian-specific configuration that
does not go into /etc, but for which you need to write a script in
/usr/sbin. This is confusing, as shown by debian-installer removing it
unconditionally. In an ideal world, the policies currently implementable
through policy-rc.d should be configuration snippets in a run-parts
style directory. The "just drop a script somewhere" idea is a
poorly-defined interface which is inflexible and inappropriate for the
purpose of a distribution, but "policy-rc.d should be replaced by
something better" is not an appropriate response to the question "what
should happen when a service fails to restart in postinst".

Also related to this problem is what happens with postinst failing for
other reasons than "the daemon doesn't restart". While that is probably
the most likely reason for postinst failures today, it is by no means
the only one; so if you say "postinst failing because of daemon restart
failing" is something that should not ever happen, I think you should
then also make guidelines as to when, exactly, a postinst should be
allowed to fail (and muck up the whole system).

-- 
To the thief who stole my anti-depressants: I hope you're happy

  -- seen somewhere on the Internet on a photo of a billboard


Reply to: