[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#984928: Acknowledgement (slurmctld: fails to start on reboot)

Hi David,
sorry for getting back to you so late. Thanks to your valuable
contribution I managed to find a working solution. 

On Fri, Aug 06, 2021 at 11:01:48AM -0300, David Bremner wrote:
> I think (one) underlying problem is that the systemd unit file for
> slurmctld is incorrect. The details are in [1], but it seems like
> network.target is not correct (I think it very rarely is a useful
> target).  I added the following
> # /etc/systemd/system/slurmctld.service.d/override.conf
> [Unit]
> After=network-online.target munge.service
> Wants=network-online.target

Yes this change is now part of the service file.

> I've switched to systemd-networkd on the hosts in question, so I can't
> easily test how this works with ifupdown, but I notice ifupdown provides
> /lib/systemd/system/ifupdown-wait-online.service
> which (guessing based on the name) should provide similar functionality
> to those documented in [1] for NetworkManager and systemd-networkd.
> [1]: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

Unfortunately using ifupdown-wait-online didn't help if I use
ifupdown and allow-hotplug interfaces, but I did not tested it
thoroughly since I want a solution that works out of the box.

Therefore I decided to patch the slurm code that is failing in order to
retry getaddrinfo before giving up starting daemons.

Best regards,
Gennaro Oliva

Reply to: