[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#984928: slurmctld: fails to start on reboot



Package: slurmctld
Version: 20.11.4-1
Severity: normal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I have a slurm cluster set up on a single node. This node is running
slurmctld, munge, and slurmd.  When I reboot the node it seems that
there is some race condition with slurmctld and/or slurmd trying to
restart before networking is fully available.  By the time I can ssh
into the machine manually restarting slurmctld and slurmd works. I
replaced "localhost" with "127.0.0.1", but that does not seem to change anything.

slurmctld.log has

[2021-03-10T07:13:08.118] slurmctld version 20.11.4 started on cluster cluster
[2021-03-10T07:13:08.132] No memory enforcing mechanism configured.
[2021-03-10T07:13:08.137] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.137] error: slurm_set_addr: Unable to resolve "127.0.0.1"
[2021-03-10T07:13:08.137] error: slurm_get_port: Address family '0' not supported
[2021-03-10T07:13:08.137] error: _set_slurmd_addr: failure on 127.0.0.1
[2021-03-10T07:13:08.137] Recovered state of 1 nodes
[2021-03-10T07:13:08.138] Recovered JobId=1651 Assoc=0
[2021-03-10T07:13:08.138] Recovered information about 1 jobs
[2021-03-10T07:13:08.138] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.140] Recovered state of 0 reservations
[2021-03-10T07:13:08.140] read_slurm_conf: backup_controller not specified
[2021-03-10T07:13:08.140] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2021-03-10T07:13:08.140] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.141] Running as primary controller
[2021-03-10T07:13:08.141] No parameter for mcs plugin, default values set
[2021-03-10T07:13:08.141] mcs: MCSParameters = (null). ondemand set.
[2021-03-10T07:13:08.142] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.142] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.142] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.144] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.144] fatal: slurm_init_msg_engine_port error Address family not supported by protocol


slurmd.log has



[2021-03-10T07:13:08.195] cgroup namespace 'freezer' is now mounted
[2021-03-10T07:13:08.198] slurmd version 20.11.4 started
[2021-03-10T07:13:08.199] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.199] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.199] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.200] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.200] error: Unable to bind listen port (6818): Address family not supported by protocol


- -- System Information:
Debian Release: bullseye/sid
  APT prefers unstable-debug
  APT policy: (500, 'unstable-debug'), (500, 'testing-security'), (500, 'testing-proposed-updates-debug'), (500, 'testing-debug'), (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-3-amd64 (SMP w/8 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), LANGUAGE=en_CA:en
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages slurmctld depends on:
ii  libc6                    2.31-9
ii  lsb-base                 11.1.0
pn  munge                    <none>
pn  slurm-client             <none>
pn  slurm-wlm-basic-plugins  <none>
ii  ucf                      3.0043

slurmctld recommends no packages.

slurmctld suggests no packages.

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEkiyHYXwaY0SiY6fqA0U5G1WqFSEFAmBItjwACgkQA0U5G1Wq
FSETBBAAozRM+8NBZYZjdMLJ09KdIXvpOzk7CDgnV1NQTetm+rZxJ1pNpir1fbIz
gzFxIlvjropFD42UJhXI1IkJa5OEoiCrlKCvwJflBdZ2Ap1Qjl/j/vWQRotr+CYk
By5I9Ason/iEEEe3TRVu2Gvs6LsB+92N4JKblpYb8Wn33P7XX4boy9/uKhmtpkDj
sQ4QAP95f+VTsMn/R36e1y3ktRvos0Ao9FAyzorPpDsyjgatN1aBYLfrJI+GSDzP
+Y38vLMcE1wkmP34H8IFmoHuHXkMrNJL8h4lzcMf2YpL2FSya/pJxcoyoRNnCz0h
tMVu2PsHWVFEWat7cQICoyDUZmdNMa396oeoPOOrh7seLwFWBRU8TRVo3+YaXDgp
oKFENCA70Xrptk48No81uKPl2uwdxcpaApecu9IYFVA7W0Tk4VlXO2LZ83VW6z3V
opAzyDQ1lJ9uGpvIQu+gMvDTbVFpdyZd7nrZylsilGqIUecaBEHAfnai73trPziY
KI/7Xwu7ipXOWrLKmWvuyMdZfvvjaGJso4S60C1YDqrI3x+G/HJKqLUMw2VRXl6r
BHOy88D1qIB3v9JxMtlW8kGQRJ4PZo79vG5vmCzKocU5jUhIclAVr2jgcOsRmHuU
vAeCTW5CuFMwQzJxHq+d6GIBg9CQi6yxHn15UBaXrxUUWth/tO0=
=hABj
-----END PGP SIGNATURE-----
SlurmctldHost=simplex(127.0.0.1)
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=simplex NodeAddr=127.0.0.1 CPUs=80 RealMemory=385570 CoresPerSocket=20 ThreadsPerCore=2 State=UNKNOWN
PartitionName=login Nodes=simplex Default=YES MaxTime=8:00:00 DefMemPerCPU=1024 MaxMemPerCPU=2048 State=UP
PartitionName=long Nodes=simplex Default=NO MaxTime=120:00:00 DefMemPerCPU=2048 MaxMemPerCPU=4096 MaxCPUsPerNode=40 State=UP
PartitionName=big Nodes=simplex Default=NO MaxTime=24:00:00 MaxCPUsPerNode=80 DefMemPerCpu=4096 State=UP
PartitionName=cron Nodes=simplex Default=NO MaxTime=2:00:00 MaxCPUsPerNode=2 MaxMemPerCPU=1024 State=UP

Reply to: