[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#954272: slurmd: SLURM not working with OpenMPI



Hi Lars,

On Thu, Mar 19, 2020 at 03:16:15PM +0100, Lars Veldscholte wrote:
> A simple test like `srun hostname` works, even on multiple cores. However, when trying to use MPI, it crashes with the following error message:
> 
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> 
> This happens even in the most simple "Hello World" case, as long as the program is MPI-enabled.
> 
> I am trying to use OpenMPI (4.0.2) from the Debian repositories. `srun --mpi list` returns:
> 
> srun: MPI types are...
> srun: openmpi
> srun: pmi2
> srun: none
> 
> I have tried all options, but the result is the same in all cases.
> 
> Maybe this is user error, as this is my first time setting up SLURM, but I have not been able to find any possible causes/solutions and I am kind of stuck at this point.

I don't know why srun doesn't execute openmpi directly, and I'll try to
investigate this issue but as a workaround you can use both sbatch and
salloc as in [1]:

salloc -n 4 mpirun mympiprogram ...

or

sbatch -n 4 mympiprogram.sh

where mympiprogram.sh is something like:

#!/bin/sh
mpirun mympiprogram ...

Notice you don't need to specify the number of processes to mpirun, as
it takes it from SLURM.

[1] https://www.open-mpi.org/faq/?category=slurm

Best regards,
-- 
Gennaro Oliva


Reply to: