[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: MPICH as default MPI; WAS: MPI debugging workflows




On 07/12/2018 11:26, Drew Parsons wrote:


Hi Alistair, openmpi3 seems to be stabilised now, packages are now passing tests and libpsm2 is no longer injecting 15 sec delays.

Nice that the mpich 3.3 release is now finalised.  Do we feel confident proceeding with the switch of mpi-defaults from openmpi to mpich?

Are there any know issues with the transition?  One that catches my eye are the build failures in scalapack.  It's been tuned to pass built time tests with openmpi but fails many tests with mpich (scalapack builds packages for both mpi implementations). I'm not sure how concerned we should be with those build failures. Perhaps upstream should be consulted on it.  Are similar mpich failures expected in other packages?   Is there a simple way of setting up a buildd to do a test run of the transition before making it official?

Drew

Hi Drew,

Looking into it further, I'm reluctant now to move to mpich for buster as the default. One was the experience of the openmpi3 transition, shaking out many issues.

I suspect we could see the same with other package builds, as you point out, tuned to openmpi rather than mpich, but also the feature support for mpich.

e.g. mpich integration with psm / pmix / slurm is weak (in Debian). While it might not look important to be able to scale to 10k+ nodes on Debian (as none of the top500 machines run Debian), we're seeing an increase in the container use case: building mpi apps within Singularity containers running on our main machine. We don't run Debian as the OS on the base supercomputer at work because we need kernel support from $vendor, but the apps are built in Singularity containers running Debian ... v. large scale jobs becom increasingly likely, and openmpi / pmix is needed for that. Testing mpich I've yet to get CH4 working reliably - needed for pmix, and the OFI / UFX support is labeled 'experimental'.

My driving use case for the move to mpich had been fault tolerance - needed for co-arrays (https://tracker.debian.org/pkg/open-coarrays) needed for Fortran 2018, but i've since re-done open-coarrays to build both openmpi and mpich variants, so that issue went away.

So I think more testing of mpich3 builds with CH4 /pmix / OFI support is needed, but moving over openmpi-> mpich at this stage is iffy.

regards

Alastair

--
Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.


Reply to: