[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Strange EPERM in mpich node on diskless cluster



Greetings,

I have an MPI program which does a popen and fread, something like:

      if (snprintf (filename, 999, "gunzip -c < %s.cpu%.4d.data",
		    basename, rank) > 999)
	return 1;
      if (!(infile = popen (filename, "r")))
	return 1;
      if (ferror (infile))
      {
	  printf ("[%d] Pipe open has error %d\n", rank, ferror(infile));
	  fflush (stdout);
      }
      ... some stuff ...
	nmemb=fread (globalarray, sizeof (PetscScalar), gridpoints * dof, infile);
	if (nmemb != gridpoints*dof)
	{
	    printf ("[%d] ferror = %d\n", rank, ferror (infile));
	    fflush (stdout);
	}

So, there seems to be no error in the popen, but on one cluster, on
between one and five CPUs, the fread results in an EPERM error.  On the
other cluster, no such error.  They're both identically-configured
Debian beowulfs using the diskless package and mpich, though the one
with no errors is made of dual AthlonXP 1.53 GHz boxes and the one with
errors of dual Opteron 240 boxes.

On the other hand, the same program earlier fopen()s a file whose path
and name are identical to the popen redirected input except for the
extension, and those work flawlessly.

machines.LINUX on the starting node (say node2) in both cases looks
something like:
node2
node2
node3
node3
etc.

Any ideas on what could be going wrong, or how to debug this?

Thanks,

-Adam P.

GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://lyre.mit.edu/~powell/The_Best_Stuff_In_The_World_Today_Cafe.ogg



Reply to: