SLURM jobs and hwloc errors
I hope this is the right list. If not, please direct me to the right one.
I installed a SLURM cluster (using the default Debian packages, that are
kept at the latest stable release) but after the last upgrade (from 10
to 11) some users started seeing messages like this:
slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing,
task/affinity plugin may be required to address bug fixed in HWLOC
slurmstepd-str957-mtx-01: error: task unable to set taskset '0x0'
It appears quite randomly: the same job, if resubmitted to the same
node, often works and the message does not reappear!
Sometimes another message (that I suppose is unrelated, but maybe not)
Open MPI's OFI driver detected multiple equidistant NICs from the
but had insufficient information to ensure MPI processes fairly pick a
NIC for use.
This may negatively impact performance. A more modern PMIx server is
resolve this issue.
Could someone more experienced please help me diagnose (or even fix)
DIFA - Dip. di Fisica e Astronomia
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786