[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#301153: libc6: Occasional EPERM during first fread() following popen()



Package: libc6
Version: 2.3.2.ds1-20

Greetings,

I have an MPI program which does a popen and fread, something like:

      if (snprintf (filename, 999, "gunzip -c < %s.cpu%.4d.data",
                    basename, rank) > 999)
        return 1;
      if (!(infile = popen (filename, "r")))
        return 1;
      if (ferror (infile))
      {
          printf ("[%d] Pipe open has error %d\n", rank, ferror(infile));
          fflush (stdout);
      }
      ... some stuff ...
        nmemb=fread (globalarray, sizeof (PetscScalar), gridpoints * dof, infile);
        if (nmemb != gridpoints*dof)
        {
            printf ("[%d] ferror = %d\n", rank, ferror (infile));
            fflush (stdout);
        }

So, there seems to be no error in the popen, but on between one and five
CPUs out of about 20, the fread results in an EPERM error.  On the other
cluster, the error is less frequent but still there.  They're both
identically-configured Debian beowulfs using the diskless package and
mpich, though the one with fewer errors is made of dual AthlonXP 1.53
GHz boxes and the one with more errors of dual Opteron 240 boxes running
Debian stock -k7-smp kernels and 32-bit userland.

On the other hand, the same program earlier fopen()s a file whose path
and name are identical to the popen redirected input except for the
extension, and those work flawlessly.

machines.LINUX on the starting node (say node2) in both cases looks
something like:
node2
node2
node3
node3
etc.

Authentication is via NIS, whose master server (and NFS server for the
files in question) is outside of the "subnet" of these clusters,
something like:

node1 node2 node3     node1 node2 node3
    \   |   /             \   |   /
    -SWITCH--             -SWITCH--
        |                     |
    headnode1             headnode2       NIS master/NFS server
        |                     |                     |
        -------------------SWITCH-------------------------Internet

This problem has since been corroborated by Sergio Visinoni who sees the
error sometimes when using fread() after popen() in a multithreaded
program.

Any ideas on what could be going wrong, or even how to debug this, would
be appreciated.

Thanks,
-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html



Reply to: