[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

etherchannel bonding problems, Something wicked happened



Hi

I'm setting up a Debian GNU/Linux based cluster, currently with 4 nodes,
each a PPro 200 :( but there may be more/other stuff coming :).
Considering the costs, we settled for Netgear 311 ethernet cards, for
which there is support in 2.4.x kernels. Patches for 2.2.x,
but since 2.4 is here... By the way I'm  running unstable on these.

Initially we have put 2 ethernet cards in each node, and today was spent
getting bonding to work.
This is supported in late 2.2.x kernels and 2.4.x of course.
But it was a bit tricky to find the correct ifenslave.c to compile and
use.
Once that was done (http://pdsf.nersc.gov/linux/), everything seemed to
work as planned after doing

ifconfig bond0 192.168.1.x netmask 255.255.255.0 up
./ifenslave bond0 eth0
(bond0 gets the MAC adress from eth0)
./ifenslave bond0 eth1 

But when testing the setup by ftping a large file between two nodes,
each configured as above (x=101,103 respectively),
messages of the following type was output repeatedly on the console:

ethX ... Something wicked happened! 0YYY

X was 0 or 1
YYY was one of 500, 700, 740, 749, 749 as far as I can tell

Same thing happened when running NPtcp as package size came above a few
kbytes, speeds approx 50MBits per second.


I also tested the network cards eth0 to eth0 and eth1 to eth1 in normal
mode (no bonding)
with NPtcp and both lines asymptotically went up to some 89.7Mbits per
second.
By the way where are the last 10?

Anyone got ideas as to the nature/solution of this problem?
I did locate the error string in drivers/net/natsemi.c in the function
netdev_error but I don't know what to make of it.
Does anyone have experience of this with for instance 3c905 which I in
my opinion is very stable etc? 
It is also about three times more expensive which isn't that much for
one or two, although I could imagine substantial savings
for a large cluster. But if my hours are included ...



Regards,
Anders





PS Some detailed info:

>From syslog, identifying network cards: (eth2 is for accessing from
outside the dedicated networks)

Mar  1 21:30:53 beo101 kernel:  
http://www.scyld.com/network/natsemi.html
Mar  1 21:30:53 beo101 kernel:   (unofficial 2.4.x kernel port, version
1.0.3, January 21, 2001 Jeff Garzik, Tjeerd Mulder)
Mar  1 21:30:53 beo101 kernel: eth0: NatSemi DP83815 at 0xc4800000,
00:02:e3:03:da:87, IRQ 12.
Mar  1 21:30:53 beo101 kernel: eth0: Transceiver status 0x7869
advertising 05e1.
Mar  1 21:30:53 beo101 kernel: eth1: NatSemi DP83815 at 0xc4802000,
00:02:e3:03:de:43, IRQ 10.
Mar  1 21:30:53 beo101 kernel: eth1: Transceiver status 0x7869
advertising 05e1.
Mar  1 21:30:53 beo101 kernel: eth2: NatSemi DP83815 at 0xc4804000,
00:02:e3:03:dc:2c, IRQ 11.
Mar  1 21:30:53 beo101 kernel: eth2: Transceiver status 0x7869
advertising 05e1.

some lines of the wicked message: (above those are the two lines where
eth0 and eth1 are reported when ifenslave is run)

Mar  1 21:30:56 beo101 /usr/sbin/cron[189]: (CRON) STARTUP (fork ok)
Mar  1 21:35:26 beo101 kernel: eth0: Setting full-duplex based on
negotiated link capability.
Mar  1 21:35:32 beo101 ntpd[182]: time reset -0.474569 s
Mar  1 21:35:32 beo101 ntpd[182]: kernel pll status change 41
Mar  1 21:35:32 beo101 ntpd[182]: synchronisation lost
Mar  1 21:35:37 beo101 kernel: eth1: Setting full-duplex based on
negotiated link capability.
Mar  1 21:38:01 beo101 /USR/SBIN/CRON[211]: (mail) CMD (  if [ -x
/usr/sbin/exim -a -f /etc/exim.conf ]; then /usr/sbin/exim -q >/dev/null
2>&1; fi)
Mar  1 21:39:49 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:04 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:08 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:08 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:12 beo101 last message repeated 2 times
Mar  1 21:40:12 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:13 beo101 last message repeated 2 times
Mar  1 21:40:15 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:16 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:18 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:19 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:19 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:20 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:20 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:21 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 last message repeated 3 times
Mar  1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0500.
Mar  1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0740.
Mar  1 21:40:22 beo101 kernel: eth0: Something Wicked happened! 0740.
Mar  1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0740.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0740.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0500.
Mar  1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0500.
Mar  1 21:40:23 beo101 kernel: eth0: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.
Mar  1 21:40:23 beo101 kernel: eth1: Something Wicked happened! 0700.


The result of ifconfig:

bond0     Link encap:Ethernet  HWaddr 00:02:E3:03:DA:87  
          inet addr:192.168.1.101  Bcast:192.168.1.255 
Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1834429 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:986886789 (941.1 Mb)

eth0      Link encap:Ethernet  HWaddr 00:02:E3:03:DA:87  
          inet addr:192.168.1.101  Bcast:192.168.1.255 
Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:907798 errors:0 dropped:0 overruns:0 frame:0
          TX packets:915439 errors:1776 dropped:0 overruns:1776
carrier:1776
          collisions:0 txqueuelen:100 
          RX bytes:435552233 (415.3 Mb)  TX bytes:491795214 (469.0 Mb)
          Interrupt:12 

eth1      Link encap:Ethernet  HWaddr 00:02:E3:03:DA:87  
          inet addr:192.168.1.101  Bcast:192.168.1.255 
Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:907768 errors:0 dropped:0 overruns:0 frame:0
          TX packets:915466 errors:1748 dropped:0 overruns:1748
carrier:1748
          collisions:0 txqueuelen:100 
          RX bytes:434992308 (414.8 Mb)  TX bytes:489766183 (467.0 Mb)
          Interrupt:10 Base address:0x2000 

eth2      Link encap:Ethernet  HWaddr 00:02:E3:03:DC:2C  
          inet addr:150.227.64.210  Bcast:150.227.64.255 
Mask:255.255.255.0
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:13122 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1182 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:1032660 (1008.4 Kb)  TX bytes:943713 (921.5 Kb)
          Interrupt:11 Base address:0x4000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:3904  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:552 (552.0 b)  TX bytes:552 (552.0 b)



Reply to: