Bug#984928: Acknowledgement (slurmctld: fails to start on reboot)
- To: David Bremner <bremner@debian.org>, 984928@bugs.debian.org
- Subject: Bug#984928: Acknowledgement (slurmctld: fails to start on reboot)
- From: Gennaro Oliva <oliva.g@na.icar.cnr.it>
- Date: Thu, 27 Jan 2022 23:15:30 +0100
- Message-id: <[🔎] YfMZgicfRR1OtYcg@ischia>
- Reply-to: Gennaro Oliva <oliva.g@na.icar.cnr.it>, 984928@bugs.debian.org
- In-reply-to: <87czqqvfhv.fsf@tethera.net>
- References: <161537798134.1991510.12203898027789323171.reportbug@convex.cs.unb.ca> <handler.984928.B.161537798514834.ack@bugs.debian.org> <878s682p12.fsf@tethera.net> <161537798134.1991510.12203898027789323171.reportbug@convex.cs.unb.ca> <87czqqvfhv.fsf@tethera.net> <161537798134.1991510.12203898027789323171.reportbug@convex.cs.unb.ca>
Hi David,
sorry for getting back to you so late. Thanks to your valuable
contribution I managed to find a working solution.
On Fri, Aug 06, 2021 at 11:01:48AM -0300, David Bremner wrote:
> I think (one) underlying problem is that the systemd unit file for
> slurmctld is incorrect. The details are in [1], but it seems like
> network.target is not correct (I think it very rarely is a useful
> target). I added the following
>
> # /etc/systemd/system/slurmctld.service.d/override.conf
> [Unit]
> After=network-online.target munge.service
> Wants=network-online.target
Yes this change is now part of the service file.
> I've switched to systemd-networkd on the hosts in question, so I can't
> easily test how this works with ifupdown, but I notice ifupdown provides
>
> /lib/systemd/system/ifupdown-wait-online.service
>
> which (guessing based on the name) should provide similar functionality
> to those documented in [1] for NetworkManager and systemd-networkd.
>
> [1]: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
Unfortunately using ifupdown-wait-online didn't help if I use
ifupdown and allow-hotplug interfaces, but I did not tested it
thoroughly since I want a solution that works out of the box.
Therefore I decided to patch the slurm code that is failing in order to
retry getaddrinfo before giving up starting daemons.
Best regards,
--
Gennaro Oliva
Reply to: