MPI debugging workflows
Hi. Does anybody have experience debugging MPI lockups? I can use some
pointers. The symptom is that the sundials test suite now hangs on some
arches. For instance on armhf:
abel% mpirun -n 4 examples/sunlinsol/spgmr/parallel/test_sunlinsol_spgmr_parallel 100 1 1 50 1e-3 0
SPGMR linear solver test:
nprocs = 4
local/global problem sizes = 100/400
Gram-Schmidt orthogonalization type = 1
Preconditioning type = 1
Maximum Krylov subspace dimension = 50
Solver Tolerance = 0.001
timing output flag = 0
PASSED test -- SUNLinSolGetType
PASSED test -- SUNLinSolSetATimes
PASSED test -- SUNLinSolSetPreconditioner
PASSED test -- SUNLinSolSetScalingVectors
PASSED test -- SUNLinSolInitialize
PASSED test -- SUNLinSolSpace, lenrw = 24701, leniw = 440
SUCCESS: SUNSPGMR module passed all initialization tests
PASSED test -- SUNLinSolSetup
PASSED test -- SUNLinSolSolve
PASSED test -- SUNLinSolLastFlag (0)
PASSED test -- SUNLinSolNumIters (7)
PASSED test -- SUNLinSolResNorm
PASSED test -- SUNLinSolResid
SUCCESS: SUNSPGMR module, problem 1, passed all tests
PASSED test -- SUNLinSolSetup
PASSED test -- SUNLinSolSolve
PASSED test -- SUNLinSolLastFlag (0)
PASSED test -- SUNLinSolNumIters (6)
PASSED test -- SUNLinSolResNorm
PASSED test -- SUNLinSolResid
SUCCESS: SUNSPGMR module, problem 2, passed all tests
PASSED test -- SUNLinSolSetup
PASSED test -- SUNLinSolSolve
PASSED test -- SUNLinSolLastFlag (0)
PASSED test -- SUNLinSolNumIters (7)
PASSED test -- SUNLinSolResNorm
PASSED test -- SUNLinSolResid
SUCCESS: SUNSPGMR module, problem 3, passed all tests
PASSED test -- SUNLinSolSetup
PASSED test -- SUNLinSolSolve
PASSED test -- SUNLinSolLastFlag (0)
PASSED test -- SUNLinSolNumIters (6)
PASSED test -- SUNLinSolResNorm
PASSED test -- SUNLinSolResid
SUCCESS: SUNSPGMR module, problem 4, passed all tests
PASSED test -- SUNLinSolSetup
PASSED test -- SUNLinSolSolve
PASSED test -- SUNLinSolLastFlag (0)
PASSED test -- SUNLinSolNumIters (7)
PASSED test -- SUNLinSolResNorm
PASSED test -- SUNLinSolResid
SUCCESS: SUNSPGMR module, problem 5, passed all tests
PASSED test -- SUNLinSolSetup
<the test then hangs here until I kill it>
The exact location of the lockup varies. If I run the same test again,
it'll lock up in a different place, or it may succeed (rarely).
Presumably there's a race condition somewhere. I haven't spent much time
on this yet. Some questions:
1. Is something like this usually a bug in the MPI implementation or in
the application using it? I.e. is it possible to write an application
using openmpi, say, with race conditions?
2. Is the MPI implementation significant? Would mpich behave potentially
differently here from openmpi?
3. Any common debugging techniques?
I'm not asking for any sundials-specific help yet, just for any
experience others may have with these problems.
Thanks.
Reply to: