[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#301153: marked as done (libc6: Occasional EPERM during first fread() following popen())



Your message dated Mon, 22 May 2006 16:46:30 +0200
with message-id <20060522144630.GA7231@henry.aurel32.net>
and subject line Bug#301153: libc6: Occasional EPERM during first fread() following popen()
has caused the attached Bug report to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what I am
talking about this indicates a serious mail system misconfiguration
somewhere.  Please contact me immediately.)

Debian bug tracking system administrator
(administrator, Debian Bugs database)

--- Begin Message ---
Package: libc6
Version: 2.3.2.ds1-20

Greetings,

I have an MPI program which does a popen and fread, something like:

      if (snprintf (filename, 999, "gunzip -c < %s.cpu%.4d.data",
                    basename, rank) > 999)
        return 1;
      if (!(infile = popen (filename, "r")))
        return 1;
      if (ferror (infile))
      {
          printf ("[%d] Pipe open has error %d\n", rank, ferror(infile));
          fflush (stdout);
      }
      ... some stuff ...
        nmemb=fread (globalarray, sizeof (PetscScalar), gridpoints * dof, infile);
        if (nmemb != gridpoints*dof)
        {
            printf ("[%d] ferror = %d\n", rank, ferror (infile));
            fflush (stdout);
        }

So, there seems to be no error in the popen, but on between one and five
CPUs out of about 20, the fread results in an EPERM error.  On the other
cluster, the error is less frequent but still there.  They're both
identically-configured Debian beowulfs using the diskless package and
mpich, though the one with fewer errors is made of dual AthlonXP 1.53
GHz boxes and the one with more errors of dual Opteron 240 boxes running
Debian stock -k7-smp kernels and 32-bit userland.

On the other hand, the same program earlier fopen()s a file whose path
and name are identical to the popen redirected input except for the
extension, and those work flawlessly.

machines.LINUX on the starting node (say node2) in both cases looks
something like:
node2
node2
node3
node3
etc.

Authentication is via NIS, whose master server (and NFS server for the
files in question) is outside of the "subnet" of these clusters,
something like:

node1 node2 node3     node1 node2 node3
    \   |   /             \   |   /
    -SWITCH--             -SWITCH--
        |                     |
    headnode1             headnode2       NIS master/NFS server
        |                     |                     |
        -------------------SWITCH-------------------------Internet

This problem has since been corroborated by Sergio Visinoni who sees the
error sometimes when using fread() after popen() in a multithreaded
program.

Any ideas on what could be going wrong, or even how to debug this, would
be appreciated.

Thanks,
-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html


--- End Message ---
--- Begin Message ---
On Sat, Apr 16, 2005 at 09:32:05PM +0900, GOTO Masanori wrote:
> At Fri, 25 Mar 2005 13:52:59 +0900,
> GOTO Masanori wrote:
> > I think this problem should be separated from MPI and clusters.  This
> > kind of random behavior is usually occured by an invalid access.  I
> > recommend you to check your program with valgrind in first, then
> > isolate the problem from MPI.
> 
> Does this problem still occur with you program?  If we'll have no
> reply for this bug, I'll close it...
> 

No reply in one year, so closing it with this mail.


-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian GNU/Linux developer | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net

--- End Message ---

Reply to: