Re: Bug in lam-6.2pl3/blacs1.1/scalapack1.6 combo
On Sun, 19 Sep 1999, Camm Maguire wrote:
> Greetings! I've found a quite reproducible bug in the above software
> combination. The command
>
> mpirun -np 16 -O N xdinv
>
> consistently fails with N=2048,nb=16,nr=nc=4 somwhere in the routine
> pdgetri, specifically in the loop from lines 285 to 306. Running with
> the -lamd option to mpirun clears the problem, seeming to indicate lam
> in the failure. The MPI routines report the following error:
>
> MPI_Recv: process in remote group is dead (rank 0, comm 3)
When running the xdlutime test program under lam 6.2b, I had a problem with
large matrix sizes. It seemed to be caused by too small of a shared memory
segment for the lam processes to communicate over. I don't have the xdinv
program, but maybe it is the same thing? I set these two enviroment variables
and they fixed my programs for xdlutime.
export LAM_MPI_SHMPOOLSIZE=32505856
export LAM_MPI_SHMMAXALLOC=2097152
I think they need to be set when lamd starts up on all the nodes, which in
effect means you will need to put them into your .bash_profile file. You can
check by running "ipcs" and checking if the size of the shm segment is 16MB or
32MB.
Reply to: