[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

p4_error



Hello,

I have here small 8-nodes PC cluster and we got MPICH error messages
that looks like this one:

p7_13798:  p4_error: socket_recv_on_fd: invalid data type %d 
: 6

Such error occurs quite randomly once in 10-40 hours of computational
time. The same software runs well on another cluster so I suspected 
hardware first, but we tried to exchange some nodes first (error seems
to occur randomly on all nodes) and even master computer with no
success. Last suspicious hardware component are the cables and hub but I'm
not sure how to test those for such random error. I managed to google out
some old references of the same error here:

http://www.beowulf.org/pipermail/beowulf/2001-August/000957.html

that hints that the problem might be perhaps with the MPICH. We use here
mpich 1.2.2 which is one of the few packages taken directly from
upstream and not from Debian. If I remember correctly the reason is
that we had some problems getting Debian's mpich running together with
PGI fortran compiler (which is the one that we have to use here).

I would be happy to hear any idea where the problem could be, what else
to check, or whether someone else have already seen this one error ...

Pavel



Reply to: