SLURM jobs and hwloc errors
Hello all.
I hope this is the right list. If not, please direct me to the right one.
I installed a SLURM cluster (using the default Debian packages, that are
kept at the latest stable release) but after the last upgrade (from 10
to 11) some users started seeing messages like this:
-8<--
slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing,
task/affinity plugin may be required to address bug fixed in HWLOC
version 1.11.5
slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'
-8<--
It appears quite randomly: the same job, if resubmitted to the same
node, often works and the message does not reappear!
Sometimes another message (that I suppose is unrelated, but maybe not)
gets logged:
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process,
but had insufficient information to ensure MPI processes fairly pick a
NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to
resolve this issue.
-8<--
Could someone more experienced please help me diagnose (or even fix)
these issues?
Tks.
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
Reply to: