[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

MPI debugging workflows



Hi. Does anybody have experience debugging MPI lockups? I can use some
pointers. The symptom is that the sundials test suite now hangs on some
arches. For instance on armhf:

    abel% mpirun -n 4 examples/sunlinsol/spgmr/parallel/test_sunlinsol_spgmr_parallel 100 1 1 50 1e-3 0

    SPGMR linear solver test:
      nprocs = 4
      local/global problem sizes = 100/400
      Gram-Schmidt orthogonalization type = 1
      Preconditioning type = 1
      Maximum Krylov subspace dimension = 50
      Solver Tolerance = 0.001
      timing output flag = 0

        PASSED test -- SUNLinSolGetType
        PASSED test -- SUNLinSolSetATimes
        PASSED test -- SUNLinSolSetPreconditioner
        PASSED test -- SUNLinSolSetScalingVectors
        PASSED test -- SUNLinSolInitialize
        PASSED test -- SUNLinSolSpace, lenrw = 24701, leniw = 440
    SUCCESS: SUNSPGMR module passed all initialization tests

        PASSED test -- SUNLinSolSetup
        PASSED test -- SUNLinSolSolve
        PASSED test -- SUNLinSolLastFlag (0)
        PASSED test -- SUNLinSolNumIters (7)
        PASSED test -- SUNLinSolResNorm
        PASSED test -- SUNLinSolResid
    SUCCESS: SUNSPGMR module, problem 1, passed all tests

        PASSED test -- SUNLinSolSetup
        PASSED test -- SUNLinSolSolve
        PASSED test -- SUNLinSolLastFlag (0)
        PASSED test -- SUNLinSolNumIters (6)
        PASSED test -- SUNLinSolResNorm
        PASSED test -- SUNLinSolResid
    SUCCESS: SUNSPGMR module, problem 2, passed all tests

        PASSED test -- SUNLinSolSetup
        PASSED test -- SUNLinSolSolve
        PASSED test -- SUNLinSolLastFlag (0)
        PASSED test -- SUNLinSolNumIters (7)
        PASSED test -- SUNLinSolResNorm
        PASSED test -- SUNLinSolResid
    SUCCESS: SUNSPGMR module, problem 3, passed all tests

        PASSED test -- SUNLinSolSetup
        PASSED test -- SUNLinSolSolve
        PASSED test -- SUNLinSolLastFlag (0)
        PASSED test -- SUNLinSolNumIters (6)
        PASSED test -- SUNLinSolResNorm
        PASSED test -- SUNLinSolResid
    SUCCESS: SUNSPGMR module, problem 4, passed all tests

        PASSED test -- SUNLinSolSetup
        PASSED test -- SUNLinSolSolve
        PASSED test -- SUNLinSolLastFlag (0)
        PASSED test -- SUNLinSolNumIters (7)
        PASSED test -- SUNLinSolResNorm
        PASSED test -- SUNLinSolResid
    SUCCESS: SUNSPGMR module, problem 5, passed all tests

        PASSED test -- SUNLinSolSetup
    <the test then hangs here until I kill it>

The exact location of the lockup varies. If I run the same test again,
it'll lock up in a different place, or it may succeed (rarely).
Presumably there's a race condition somewhere. I haven't spent much time
on this yet. Some questions:

1. Is something like this usually a bug in the MPI implementation or in
the application using it? I.e. is it possible to write an application
using openmpi, say, with race conditions?

2. Is the MPI implementation significant? Would mpich behave potentially
differently here from openmpi?

3. Any common debugging techniques?

I'm not asking for any sundials-specific help yet, just for any
experience others may have with these problems.

Thanks.


Reply to: