[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [OT] How to test a bunch of ethernet cards of a cluster?



On Thu, May 18, 2006 at 10:11:34AM -0500, anoop aryal wrote:
> On Saturday 06 May 2006 13:05, James He wrote:
> > Hi, all
> >
> > My boss wants me to test a bunch of gigabit ethernet cards of a
> > cluster. He kept getting time-out problems when running some MPI jobs
> > on the cluster. The problem only happens when the network traffic is
> > very high (~100MB/s). Therefore, he wants me to determine which
> > ethernet card(s) is/are having the problem when the traffic is high.
> > (We don't get any useful information from syslog or the log files of
> > MPI jobs.)
> >
> > I've seen people testing the ethernet card using nc (netcat) -- just
> > transfer some files using nc and then compare them. Is there any
> > better way to do this, or any suggestions about some existing
> > softwares which can automate this, for I have to test a bunch of them?
> > Thanks a lot.
> 
> use SNMP to monitor dropped packets, bandwidth utilization etc.. for each one 
> of the machines. (if you haven't used snmp before, you can setup the monitor 
> on one machine to monitor all your machines provided you setup the snmp 
> daemon on each machine you want to monitor)
> 
> if you have a managed switch, setup snmp daemon on the switch as well and 
> monitor the switch as well.
> 
> this is not going to help you test it - since i can't tell what may be the 
> problem yet - but might give you enough clues as to a pattern of usage that 
> causes this which you can use to start developing a testing strategy.
> 
> 'mrtg' is good for tracking bandwidth usage but not much else. i think there 
> is a package called 'cacti' that aims to be a more complete snmp monitoring 
> software. the daemon you need to use used to be called 'net-snmp'. i think 
> it's called 'snmpd' these days.
> 
> i typically write my own client - depending on what parts i want to monitor 
> and graph - using python and rrd.
> 
> you might want to pay special attention to the dropped packet related OIDs in 
> either the udp or tcp sections. someone is dropping packets for it to 
> timeout. if you find out who (sender/reciever/switch) you might also find out 
> why.
> 
> also, tcpdump any ICMP packets. you just might get lucky ;)
> 
> hope that helps.
> 

Reading this response provoked a thought: 

The people in the open software world who know most about the
performance of ethernet hardware are probably involved in the Beowulf
project ( beowulf.org ) and, in particular, Donald Becker. If I were
charged with testing some ethernet cards, I would be sure to
understand their work in great detail before doing any touch labor of
setting up my own testing program.

I remember having seen evaluations of various ethrenet cards based on
actual testing, but I can't recall having any details of the actual
test setups and procedures. But surely, those details are available
if you ask.

Beyond that, hardware testing usually involves some special test setup
in which one can change the operating environment in a controlled way.
One popular control is the ability to change the supply voltage to
higher and lower than the nominal. This was called 'running margins'
long ago when I had some involvement in hardware testing.

-- 
Paul E Condon           
pecondon@mesanetworks.net



Reply to: