[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug in lam-6.2pl3/blacs1.1/scalapack1.6 combo

Greetiings!  I'm forwarding this previously submitted bug report to
the beowulf lists and the lam users list to look for interested users
who could either confirm, deny, or help resolve this bug.

From: Camm Maguire <camm@enhanced.com>
To: scalapack@cs.utk.edu,llamas@mpi.nd.edu
cc: camm@enhanced.com
Subject: Bug in lam-6.2pl3/blacs1.1/scalapack1.6 combo
Mime-Version: 1.0 (generated by tm-edit 7.106)
Content-Type: text/plain; charset=US-ASCII
Message-Id: <E11SAqu-0007kf-00@intech19.enhanced.com>
Date: Fri, 17 Sep 1999 23:07:52 -0400

Greetings!  I've found a quite reproducible bug in the above software
combination.  The command

 mpirun -np 16 -O N xdinv

consistently fails with N=2048,nb=16,nr=nc=4 somwhere in the routine
pdgetri, specifically in the loop from lines 285 to 306.  Running with
the -lamd option to mpirun clears the problem, seeming to indicate lam
in the failure.  The MPI routines report the following error:

MPI_Recv: process in remote group is dead (rank 0, comm 3)

where the rank and comm numbers vary with no discernable pattern.  I'm
running Linux 2.2.12, on a 16 Node PII350 Beowulf over 100Mbit
switched fast ethernet.  There are no errors reported in the kernel
logs.  LAM was configured with

	./configure --prefix=`pwd`/debian/tmp/usr/lib/lam \
		    --with-final-home=/usr/lib/lam \
		     --with-rpi=usysv \
		     --with-shared \

and built with 

intech19:/fix/c/home/camm/scalapack-1.6# egcc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/egcs-2.91.60/specs
gcc version egcs-2.91.60 Debian 2.1 (egcs-1.1.1 release)

I've noticed that the (at least most frequent) problem block size is
16 when using double precision, which corresponds to a 2k message, the
same length as the reported lam/Linux performance problem on the web
site.   Of course, here we don't just see poor performance, but
failure.  I'll be trying lam 6.2 pl4 soon.  Please advise if I can
supply any further information regarding this bug.

PS.  Since writing this, I've tried lam-6.2b-pl4, and fournd the same
situation.  The problem appears for block sizes in the 16-28 range ;
outside that range all is stable.  Blacs is patched with the latest
mpi patch.

Take care,

Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

Reply to: