[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#984928: marked as done (slurmctld: fails to start on reboot)



Your message dated Fri, 28 Jan 2022 23:34:02 +0000
with message-id <E1nDakk-000FNa-BH@fasolo.debian.org>
and subject line Bug#984928: fixed in slurm-wlm 21.08.5-2
has caused the Debian Bug report #984928,
regarding slurmctld: fails to start on reboot
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
984928: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984928
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: slurmctld
Version: 20.11.4-1
Severity: normal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I have a slurm cluster set up on a single node. This node is running
slurmctld, munge, and slurmd.  When I reboot the node it seems that
there is some race condition with slurmctld and/or slurmd trying to
restart before networking is fully available.  By the time I can ssh
into the machine manually restarting slurmctld and slurmd works. I
replaced "localhost" with "127.0.0.1", but that does not seem to change anything.

slurmctld.log has

[2021-03-10T07:13:08.118] slurmctld version 20.11.4 started on cluster cluster
[2021-03-10T07:13:08.132] No memory enforcing mechanism configured.
[2021-03-10T07:13:08.137] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.137] error: slurm_set_addr: Unable to resolve "127.0.0.1"
[2021-03-10T07:13:08.137] error: slurm_get_port: Address family '0' not supported
[2021-03-10T07:13:08.137] error: _set_slurmd_addr: failure on 127.0.0.1
[2021-03-10T07:13:08.137] Recovered state of 1 nodes
[2021-03-10T07:13:08.138] Recovered JobId=1651 Assoc=0
[2021-03-10T07:13:08.138] Recovered information about 1 jobs
[2021-03-10T07:13:08.138] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.140] Recovered state of 0 reservations
[2021-03-10T07:13:08.140] read_slurm_conf: backup_controller not specified
[2021-03-10T07:13:08.140] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2021-03-10T07:13:08.140] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2021-03-10T07:13:08.141] Running as primary controller
[2021-03-10T07:13:08.141] No parameter for mcs plugin, default values set
[2021-03-10T07:13:08.141] mcs: MCSParameters = (null). ondemand set.
[2021-03-10T07:13:08.142] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.142] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.142] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.144] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.144] fatal: slurm_init_msg_engine_port error Address family not supported by protocol


slurmd.log has



[2021-03-10T07:13:08.195] cgroup namespace 'freezer' is now mounted
[2021-03-10T07:13:08.198] slurmd version 20.11.4 started
[2021-03-10T07:13:08.199] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2021-03-10T07:13:08.199] error: slurm_set_addr: Unable to resolve "(null)"
[2021-03-10T07:13:08.199] error: slurm_set_port: attempting to set port without address family
[2021-03-10T07:13:08.200] error: Error creating slurm stream socket: Address family not supported by protocol
[2021-03-10T07:13:08.200] error: Unable to bind listen port (6818): Address family not supported by protocol


- -- System Information:
Debian Release: bullseye/sid
  APT prefers unstable-debug
  APT policy: (500, 'unstable-debug'), (500, 'testing-security'), (500, 'testing-proposed-updates-debug'), (500, 'testing-debug'), (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-3-amd64 (SMP w/8 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8), LANGUAGE=en_CA:en
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages slurmctld depends on:
ii  libc6                    2.31-9
ii  lsb-base                 11.1.0
pn  munge                    <none>
pn  slurm-client             <none>
pn  slurm-wlm-basic-plugins  <none>
ii  ucf                      3.0043

slurmctld recommends no packages.

slurmctld suggests no packages.

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEkiyHYXwaY0SiY6fqA0U5G1WqFSEFAmBItjwACgkQA0U5G1Wq
FSETBBAAozRM+8NBZYZjdMLJ09KdIXvpOzk7CDgnV1NQTetm+rZxJ1pNpir1fbIz
gzFxIlvjropFD42UJhXI1IkJa5OEoiCrlKCvwJflBdZ2Ap1Qjl/j/vWQRotr+CYk
By5I9Ason/iEEEe3TRVu2Gvs6LsB+92N4JKblpYb8Wn33P7XX4boy9/uKhmtpkDj
sQ4QAP95f+VTsMn/R36e1y3ktRvos0Ao9FAyzorPpDsyjgatN1aBYLfrJI+GSDzP
+Y38vLMcE1wkmP34H8IFmoHuHXkMrNJL8h4lzcMf2YpL2FSya/pJxcoyoRNnCz0h
tMVu2PsHWVFEWat7cQICoyDUZmdNMa396oeoPOOrh7seLwFWBRU8TRVo3+YaXDgp
oKFENCA70Xrptk48No81uKPl2uwdxcpaApecu9IYFVA7W0Tk4VlXO2LZ83VW6z3V
opAzyDQ1lJ9uGpvIQu+gMvDTbVFpdyZd7nrZylsilGqIUecaBEHAfnai73trPziY
KI/7Xwu7ipXOWrLKmWvuyMdZfvvjaGJso4S60C1YDqrI3x+G/HJKqLUMw2VRXl6r
BHOy88D1qIB3v9JxMtlW8kGQRJ4PZo79vG5vmCzKocU5jUhIclAVr2jgcOsRmHuU
vAeCTW5CuFMwQzJxHq+d6GIBg9CQi6yxHn15UBaXrxUUWth/tO0=
=hABj
-----END PGP SIGNATURE-----
SlurmctldHost=simplex(127.0.0.1)
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=simplex NodeAddr=127.0.0.1 CPUs=80 RealMemory=385570 CoresPerSocket=20 ThreadsPerCore=2 State=UNKNOWN
PartitionName=login Nodes=simplex Default=YES MaxTime=8:00:00 DefMemPerCPU=1024 MaxMemPerCPU=2048 State=UP
PartitionName=long Nodes=simplex Default=NO MaxTime=120:00:00 DefMemPerCPU=2048 MaxMemPerCPU=4096 MaxCPUsPerNode=40 State=UP
PartitionName=big Nodes=simplex Default=NO MaxTime=24:00:00 MaxCPUsPerNode=80 DefMemPerCpu=4096 State=UP
PartitionName=cron Nodes=simplex Default=NO MaxTime=2:00:00 MaxCPUsPerNode=2 MaxMemPerCPU=1024 State=UP

--- End Message ---
--- Begin Message ---
Source: slurm-wlm
Source-Version: 21.08.5-2
Done: Gennaro Oliva <oliva.g@na.icar.cnr.it>

We believe that the bug you reported is fixed in the latest version of
slurm-wlm, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 984928@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Gennaro Oliva <oliva.g@na.icar.cnr.it> (supplier of updated slurm-wlm package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Format: 1.8
Date: Fri, 28 Jan 2022 23:36:39 +0100
Source: slurm-wlm
Architecture: source
Version: 21.08.5-2
Distribution: unstable
Urgency: medium
Maintainer: Debian HPC Team <debian-hpc@lists.debian.org>
Changed-By: Gennaro Oliva <oliva.g@na.icar.cnr.it>
Closes: 984928 1004409
Changes:
 slurm-wlm (21.08.5-2) unstable; urgency=medium
 .
   * Add slurm_version.h to libslurm-dev (Closes: #1004409)
   * Add retry-getaddrinfo patch (Closes: #984928)
   * Update service files
Checksums-Sha1:
 5a8da3888c50f618def6c9990b68829464d86a94 3738 slurm-wlm_21.08.5-2.dsc
 ced5ef2dbe2bdfdc01317908c54b5514b1e7af75 132636 slurm-wlm_21.08.5-2.debian.tar.xz
 a17f9cb5d1a394e47d5b40278e8b1eef1ad934e3 21635 slurm-wlm_21.08.5-2_amd64.buildinfo
Checksums-Sha256:
 50a2bd018ae7e9210231e721f37d5d275019b0a86fa53c587e3f6c2bf2d316e8 3738 slurm-wlm_21.08.5-2.dsc
 9fcbf38dbf0cc797e772249a8cf924bebe24900fc742b70e17b5082352c1c541 132636 slurm-wlm_21.08.5-2.debian.tar.xz
 c37673d8331b5a4d23feba1417f85e8f95de633b4266c7f8667bce6d76c5e5a4 21635 slurm-wlm_21.08.5-2_amd64.buildinfo
Files:
 94b0b11242d636cd214e681a6a194284 3738 admin optional slurm-wlm_21.08.5-2.dsc
 fd2f63cdcef05d7e708232bfc3f2e405 132636 admin optional slurm-wlm_21.08.5-2.debian.tar.xz
 f71bfd0e33830287f7f237a66a2f46dd 21635 admin optional slurm-wlm_21.08.5-2_amd64.buildinfo

-----BEGIN PGP SIGNATURE-----

iQJLBAEBCgA1FiEE6zNF9WRBuLgad5h2ffpBrZYZhdcFAmH0ctUXHG9saXZhLmdA
bmEuaWNhci5jbnIuaXQACgkQffpBrZYZhdf3Tw/6A+wW+GOjvdpSKQ+cmmftE2wb
Q6kDWVjVkhEmN27U+naZD/c5ggkQFbjc92+P5e3F1lXhsIq4Q6Eu2VRa/2PtjP8i
1mjiNIBnvhdWZz+jC3l4bcCMSwXWjtfSqS6jB3YVKdjMebcr9WnH8D/pKq1UO4FD
vYyzNJJa5y2liUIy/rGHp3V1q2MTDWPh3O+DbLnUcSbaOD6igVqNsj1FzSuWry3w
/ZwjDeciT+gSNdWkoK0ef2GbpITs+yeQmkP6Hpe1WwX/OlTaSaAYFe8aKXGDnUZD
cru2seztmcxaXlP3LYaAm8aDyaTmWQT34EKO9IXBzGf8Hqx7MG+ODylpxsgSZhwH
N2mkguFNuU0BU6e/t3/9OLJmAQ8wqY2I/jBOOkjMvx9Ei56qJdLwBQaLs//BxfYZ
OOpxM0qr57odmrJrtUrRrOt8IcrFET0L5zyC+1PFMTshyOXRuTbiuAg86I66+C2j
UOhxDNiys5KHFh4Pu6cMuAqsyJVi7dj4XQ01HIapbIDALL6ljakTRXY6VHtadvfl
2eh2UM3PuB9peYNtDwcB3rzSBn/N1wfNG6wAHhbhY3noEBhQvwOScAunABIGkMIq
Ml9xn8c8T83PAZCvYFN//jfPg0JCU0HeJkulM8/dyZ3wvCt7DCxttnCGs3Dsg3ZC
GqPdHxfv0VwOmOJBOmQ=
=Lwgu
-----END PGP SIGNATURE-----

--- End Message ---

Reply to: