[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Occasional EPERM in mpich node on a diskless cluster



Greetings,

I have an MPI program which does a popen and fread, something like:

      if (snprintf (filename, 999, "gunzip -c < %s.cpu%.4d.data",
                    basename, rank) > 999)
        return 1;
      if (!(infile = popen (filename, "r")))
        return 1;
      if (ferror (infile))
      {
          printf ("[%d] Pipe open has error %d\n", rank, ferror(infile));
          fflush (stdout);
      }
      ... some stuff ...
        nmemb=fread (globalarray, sizeof (PetscScalar), gridpoints * dof, infile);
        if (nmemb != gridpoints*dof)
        {
            printf ("[%d] ferror = %d\n", rank, ferror (infile));
            fflush (stdout);
        }

So, there seems to be no error in the popen, but on between one and five
CPUs, the fread results in an EPERM error.  On the other cluster, the
error is less frequent but still there.  They're both
identically-configured Debian beowulfs using the diskless package and
mpich, though the one with fewer errors is made of dual AthlonXP 1.53
GHz boxes and the one with errors of dual Opteron 240 boxes running
Debian stock -k7-smp kernels and 32-bit userland.

On the other hand, the same program earlier fopen()s a file whose path
and name are identical to the popen redirected input except for the
extension, and those work flawlessly.

machines.LINUX on the starting node (say node2) in both cases looks
something like:
node2
node2
node3
node3
etc.

Authentication is via NIS, whose master server (and NFS server for the
files in question) is outside of the "subnet" of these clusters,
something like:

node1 node2 node3     node1 node2 node3
    \   |   /             \   |   /
    -SWITCH--             -SWITCH--
        |                     |
    headnode1             headnode2       NIS master/NFS server
        |                     |                     |
        -------------------SWITCH-------------------------Internet

Any ideas on what could be going wrong, or how to debug this?

Please CC me in replies as I'm not subscribed.

Thanks,

-Adam P.

GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://lyre.mit.edu/~powell/The_Best_Stuff_In_The_World_Today_Cafe.ogg



Reply to: