[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

network reliability testing

Last few days i get random errors in mpich communication on my cluster
with typical error message looking like this:

p2_2983:  p4_error: socket_recv_on_fd: invalid data type %d

Since such messages appears only recently, quite randomly and after
several hours of computation I guess that the reason is some hardware
problem. I tried some tests (like prime95) on all the nodes but found
nothing so far. The most suspected component is the switch or maybe
cabels now. Does anyone know about some tool that could put heavy load
on my internal cluster network and test whether communication is ok?

Pavel Jurus

Reply to: