Re: MPICH as default MPI; WAS: MPI debugging workflows
On 07/12/2018 11:26, Drew Parsons wrote:
Hi Alistair, openmpi3 seems to be stabilised now, packages are now 
passing tests and libpsm2 is no longer injecting 15 sec delays.
Nice that the mpich 3.3 release is now finalised.  Do we feel 
confident proceeding with the switch of mpi-defaults from openmpi to 
mpich?
Are there any know issues with the transition?  One that catches my 
eye are the build failures in scalapack.  It's been tuned to pass 
built time tests with openmpi but fails many tests with mpich 
(scalapack builds packages for both mpi implementations). I'm not sure 
how concerned we should be with those build failures. Perhaps upstream 
should be consulted on it.  Are similar mpich failures expected in 
other packages?   Is there a simple way of setting up a buildd to do a 
test run of the transition before making it official?
Drew
Hi Drew,
Looking into it further, I'm reluctant now to move to mpich for buster 
as the default. One was the experience of the openmpi3 transition, 
shaking out many issues.
I suspect we could see the same with other package builds, as you point 
out, tuned to openmpi rather than mpich, but also the feature support 
for mpich.
e.g. mpich integration with psm / pmix / slurm is weak (in Debian). 
While it might not look important to be able to scale to 10k+ nodes on 
Debian (as none of the top500 machines run Debian), we're seeing an 
increase in the container use case: building mpi apps within Singularity 
containers running on our main machine. We don't run Debian as the OS on 
the base supercomputer at work because we need kernel support from 
$vendor, but the apps are built in Singularity containers running Debian 
... v. large scale jobs becom increasingly likely, and openmpi / pmix is 
needed for that. Testing mpich I've yet to get CH4 working reliably - 
needed for pmix, and the OFI / UFX support is labeled 'experimental'.
My driving use case for the move to mpich had been fault tolerance - 
needed for co-arrays (https://tracker.debian.org/pkg/open-coarrays) 
needed for Fortran 2018, but i've since re-done open-coarrays to build 
both openmpi and mpich variants, so that issue went away.
So I think more testing of mpich3 builds with CH4 /pmix / OFI support is 
needed, but moving over openmpi-> mpich at this stage is iffy.
regards
Alastair
--
Alastair McKinstry, <alastair@sceal.ie>, <mckinstry@debian.org>, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.
Reply to: