[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [OT] How to test a bunch of ethernet cards of a cluster?



On Saturday 06 May 2006 13:05, James He wrote:
> Hi, all
>
> My boss wants me to test a bunch of gigabit ethernet cards of a
> cluster. He kept getting time-out problems when running some MPI jobs
> on the cluster. The problem only happens when the network traffic is
> very high (~100MB/s). Therefore, he wants me to determine which
> ethernet card(s) is/are having the problem when the traffic is high.
> (We don't get any useful information from syslog or the log files of
> MPI jobs.)
>
> I've seen people testing the ethernet card using nc (netcat) -- just
> transfer some files using nc and then compare them. Is there any
> better way to do this, or any suggestions about some existing
> softwares which can automate this, for I have to test a bunch of them?
> Thanks a lot.

use SNMP to monitor dropped packets, bandwidth utilization etc.. for each one 
of the machines. (if you haven't used snmp before, you can setup the monitor 
on one machine to monitor all your machines provided you setup the snmp 
daemon on each machine you want to monitor)

if you have a managed switch, setup snmp daemon on the switch as well and 
monitor the switch as well.

this is not going to help you test it - since i can't tell what may be the 
problem yet - but might give you enough clues as to a pattern of usage that 
causes this which you can use to start developing a testing strategy.

'mrtg' is good for tracking bandwidth usage but not much else. i think there 
is a package called 'cacti' that aims to be a more complete snmp monitoring 
software. the daemon you need to use used to be called 'net-snmp'. i think 
it's called 'snmpd' these days.

i typically write my own client - depending on what parts i want to monitor 
and graph - using python and rrd.

you might want to pay special attention to the dropped packet related OIDs in 
either the udp or tcp sections. someone is dropping packets for it to 
timeout. if you find out who (sender/reciever/switch) you might also find out 
why.

also, tcpdump any ICMP packets. you just might get lucky ;)

hope that helps.

>
> --
> Best regards,
>
> James He

-- 

anoop aryal
aaryal@foresightint.com



Reply to: