[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug in lam-6.2pl3/blacs1.1/scalapack1.6 combo



On Sun, 19 Sep 1999, Camm Maguire wrote:
> Greetings!  I've found a quite reproducible bug in the above software
> combination.  The command
> 
>  mpirun -np 16 -O N xdinv
> 
> consistently fails with N=2048,nb=16,nr=nc=4 somwhere in the routine
> pdgetri, specifically in the loop from lines 285 to 306.  Running with
> the -lamd option to mpirun clears the problem, seeming to indicate lam
> in the failure.  The MPI routines report the following error:
> 
> MPI_Recv: process in remote group is dead (rank 0, comm 3)

When running the xdlutime test program under lam 6.2b, I had a problem with
large matrix sizes.  It seemed to be caused by too small of a shared memory
segment for the lam processes to communicate over.  I don't have the xdinv
program, but maybe it is the same thing?  I set these two enviroment variables
and they fixed my programs for xdlutime.

export LAM_MPI_SHMPOOLSIZE=32505856
export LAM_MPI_SHMMAXALLOC=2097152

I think they need to be set when lamd starts up on all the nodes, which in
effect means you will need to put them into your .bash_profile file.  You can
check by running "ipcs" and checking if the size of the shm segment is 16MB or
32MB.


Reply to: