[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#904558: What should happen when maintscripts fail to restart a service



Hi Tollef,

On Fri, Sep 21, 2018 at 09:53:13PM +0200, Tollef Fog Heen wrote:
> ]] Wouter Verhelst 
> 
> > On Tue, Sep 18, 2018 at 10:04:26PM +0200, Tollef Fog Heen wrote:
> 
> [...]
> 
> > > The API provided by a package being in the configured state is not
> > > whether the relevant daemon is running or not; that is runtime and can
> > > and will change many times while the package is in the configured state,
> > > so dpkg dependencies are not useful for expressing «this service must be
> > > running».
> > 
> > No. But it *is* a useful way to express "this service must be able to
> > run".
> 
> That's not what «configured» means, though.

Disagree.

> «apt install foo ; rm /etc/foo.conf» and the package will be in a «running,
> but can't restart» state, but also configured in dpkg terms.

Well, sure, but that's true for any kind of configuration, and is not
specific to daemons: if you blow away a package's configuration, all
bets are off, so I fail to see your point.

The point is not "what happens after the install run has happened"; it
is about finding problems early rather than late.

> > Additionally, if something fails to restart, then that is a serious
> > problem that I, as a system administrator, would like to know about.
> > Failure to configure a package signals that there is a serious problem
> > that I need to fix, so that informs me.
> 
> I think monitoring should be implemented using monitoring tools, so if
> you actually care if a service is up, you should monitor it rather than
> relying on postinsts failing or succeeding.

First, the fact that there are tools to deal with this externally from
dpkg shouldn't mean that dpkg itself can't deal with it.

Second, if I manually upgrade something and postinst fails, I know
immediately that something is wrong; in contrast, if I upgrade something
but postinst does not fail, and then I have to rely on monitoring to
notify me, it may take a while before I notice something is wrong,
because monitoring tools often only tell me after a few minutes.

Third, the person who performs the upgrade is not necessarily the same
person as the one who notices something is wrong on the monitoring
system; the lack of immediate feedback that the upgrade broke things
will make debugging and fixing the problem more involved than it should
be.

I think "there are tools to do X" is a terrible argument for "postinst
shouldn't do X".

> Alternatively, you could just add «systemctl is-system-running» to a
> post-dpkg-invoke hook, it'll tell you if there are daemons that have
> failed.

The fact that I can do something to fix the fact that someone (you?)
broke reasonable expectations isn't an excuse for breaking those
reasonable expectations in the first place.

> [...]
> 
> > There are really only two[1] reasons why a daemon could fail to restart:
> > 
> > - The maintainer made a mistake in the default configuration, and the
> >   user didn't make any changes so the old conffiles are being replaced
> >   by the new ones, or the package is being newly installed; now the
> >   daemon encounters a syntax error. This is a bug, plain and simple, and
> >   catching bugs earlier rather than later is a good idea, which will
> >   happen if the daemon restart failure causes a postinst failure.
> > - The maintainer made no mistake, but the upgrading user made some local
> >   changes, so the conffile system ensures that the syntactic differences
> >   in the configuration are not incorporated and the daemon fails to
> >   restart. As a system administrator, I would want to know when
> >   something like that happens sooner rather than later, so that I can
> >   fix it (also sooner rather than later). Failing to finish postinst
> >   correctly ensures that that does happen.
> 
> In addition to this: Any number of runtime problems.  The disk might be
> full.  The service might try to look up a user whose name is in LDAP and
> the network is down and thus the user lookup fails.  Some hardware the
> service needs is not plugged in or doesn't work correctly.  Data files
> are corrupted.  Out of memory.  I'm sure you can come up with more. :-)

Well, yeah, and I like it if dpkg gives me an error when I try to
install something and, say, the disk is full.

> This then also ties into what the semantics of «daemon is started»
> should be: is it that the service has started, or that it is working?
> What should happen if you, on a host with no network connectivity (or
> just heavily firewalled), do «apt install ntp»?  Should it wait until
> the clock is synced (effectively forever in this case?  Should the
> postinst fail until you've fixed the firewall?)?

If the daemon is running and it would work as soon as it can reach then
internet? No, it should continue.

If the daemon is failing to start because of, say, mandatory access
control not being configured yet? Yes, in that case it should fail,
because that is a dependency bug, and we want to know about it.

> > [1] There is also the possibility of "the package ships with incomplete
> >     configuration on purpose, because there are no sane defaults to use
> >     and installing the package requires manual steps from the maintainer
> >     before it can be made to work", but (a) our best practices recommend
> >     against doing that if at all possible, and (b) in that case starting
> >     the daemon shouldn't even be attempted from postinst, and so failure
> >     to start can't be a consideration in the exit state of postinst.
> 
> You might still want to restart it on upgrade to ensure you don't run
> outdated binaries.

Sure. This bug isn't about "you might still want to do X" though, it's
about "what should we do if X fails". Let's stick to the core issue?

-- 
Could you people please use IRC like normal people?!?

  -- Amaya Rodrigo Sastre, trying to quiet down the buzz in the DebConf 2008
     Hacklab


Reply to: