[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#572201: linux-image-2.6.32-trunk-amd64: forcedeth driver hangs under heavy load



So I changed all 45 nodes in the cluster back to the 2.6.32 kernel and
restarted a test job. After 15-20 minutes, some of the nodes had dropped
out - responding still to pings, but impossible to ssh to them.

Output from one node as follows (after connecting to console),

Ben Hutchings wrote:
We'll want to see the kernel log (output from dmesg) after this happens,
even if you can't spot anything in it.

The problem first started at 08:56 - as you can see the last messages in /var/log/kern.log are from 08:31 (just after booting) - the output from dmesg is identical

Mar  9 08:31:11 node20 kernel: [    9.102710]  sdc1
Mar 9 08:31:11 node20 kernel: [ 9.102890] sd 2:0:0:0: [sdc] Attached SCSI disk
Mar  9 08:31:11 node20 kernel: [    9.104582]  sda1 sda2 < sda5 sda6 >
Mar 9 08:31:11 node20 kernel: [ 9.131101] sd 0:0:0:0: [sda] Attached SCSI disk Mar 9 08:31:11 node20 kernel: [ 9.133794] sd 0:0:0:0: Attached scsi generic sg0 type 0 Mar 9 08:31:11 node20 kernel: [ 9.133824] sd 1:0:0:0: Attached scsi generic sg1 type 0 Mar 9 08:31:11 node20 kernel: [ 9.133849] sd 2:0:0:0: Attached scsi generic sg2 type 0 Mar 9 08:31:11 node20 kernel: [ 9.133875] sd 3:0:0:0: Attached scsi generic sg3 type 0 Mar 9 08:31:11 node20 kernel: [ 9.133970] sr 4:0:0:0: Attached scsi generic sg4 type 5 Mar 9 08:31:11 node20 kernel: [ 9.324705] PM: Starting manual resume from disk Mar 9 08:31:11 node20 kernel: [ 9.339131] EXT4-fs (sda1): INFO: recovery required on readonly filesystem Mar 9 08:31:11 node20 kernel: [ 9.339135] EXT4-fs (sda1): write access will be enabled during recovery Mar 9 08:31:11 node20 kernel: [ 10.429924] EXT4-fs (sda1): recovery complete Mar 9 08:31:11 node20 kernel: [ 10.430834] EXT4-fs (sda1): mounted filesystem with ordered data mode
Mar  9 08:31:11 node20 kernel: [   11.227223] udev: starting version 151
Mar 9 08:31:11 node20 kernel: [ 11.447716] processor LNXCPU:00: registered as cooling_device0 Mar 9 08:31:11 node20 kernel: [ 11.448024] processor LNXCPU:01: registered as cooling_device1 Mar 9 08:31:11 node20 kernel: [ 11.448329] processor LNXCPU:02: registered as cooling_device2 Mar 9 08:31:11 node20 kernel: [ 11.448646] processor LNXCPU:03: registered as cooling_device3 Mar 9 08:31:11 node20 kernel: [ 11.448950] processor LNXCPU:04: registered as cooling_device4 Mar 9 08:31:11 node20 kernel: [ 11.449253] processor LNXCPU:05: registered as cooling_device5 Mar 9 08:31:11 node20 kernel: [ 11.449557] processor LNXCPU:06: registered as cooling_device6 Mar 9 08:31:11 node20 kernel: [ 11.449858] processor LNXCPU:07: registered as cooling_device7 Mar 9 08:31:11 node20 kernel: [ 11.584225] i2c i2c-0: nForce2 SMBus adapter at 0x2d00 Mar 9 08:31:11 node20 kernel: [ 11.584244] i2c i2c-1: nForce2 SMBus adapter at 0x2e00 Mar 9 08:31:11 node20 kernel: [ 11.614078] input: PC Speaker as /devices/platform/pcspkr/input/input5 Mar 9 08:31:11 node20 kernel: [ 11.699803] EDAC MC: Ver: 2.1.0 Jan 10 2010 Mar 9 08:31:11 node20 kernel: [ 11.826247] EDAC amd64_edac: Ver: 3.2.0 Jan 10 2010 Mar 9 08:31:11 node20 kernel: [ 11.826765] Error: Driver 'pcspkr' is already registered, aborting... Mar 9 08:31:11 node20 kernel: [ 11.826812] EDAC amd64: ECC is enabled by BIOS. Mar 9 08:31:11 node20 kernel: [ 11.826992] EDAC amd64: ECC is enabled by BIOS.
Mar  9 08:31:11 node20 kernel: [   11.827029] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827095] EDAC MC0: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
Mar  9 08:31:11 node20 kernel: [   11.827098] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827157] EDAC MC1: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:19.2 Mar 9 08:31:11 node20 kernel: [ 11.827174] EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED) Mar 9 08:31:11 node20 kernel: [ 12.124910] Adding 32170120k swap on /dev/sda6. Priority:-1 extents:1 across:32170120k
Mar  9 08:31:11 node20 kernel: [   12.338321] loop: module loaded
Mar 9 08:31:11 node20 kernel: [ 14.407171] EXT4-fs (sda5): mounted filesystem with ordered data mode Mar 9 08:31:11 node20 kernel: [ 15.002385] EXT4-fs (sdb1): recovery complete Mar 9 08:31:11 node20 kernel: [ 15.002786] EXT4-fs (sdb1): mounted filesystem with ordered data mode Mar 9 08:31:11 node20 kernel: [ 15.568677] EXT4-fs (sdc1): recovery complete Mar 9 08:31:11 node20 kernel: [ 15.570225] EXT4-fs (sdc1): mounted filesystem with ordered data mode Mar 9 08:31:11 node20 kernel: [ 16.180705] EXT4-fs (sdd1): recovery complete Mar 9 08:31:11 node20 kernel: [ 16.180909] EXT4-fs (sdd1): mounted filesystem with ordered data mode Mar 9 08:31:11 node20 kernel: [ 16.834881] alloc irq_desc for 30 on node 0
Mar  9 08:31:11 node20 kernel: [   16.834885]   alloc kstat_irqs on node 0
Mar 9 08:31:11 node20 kernel: [ 16.834895] forcedeth 0000:00:08.0: irq 30 for MSI/MSI-X
Mar  9 08:31:21 node20 kernel: [   27.456017] eth0: no IPv6 routers present



The device statistics (output from ethtool -S eth0) might also be
informative.

NIC statistics:
     tx_bytes: 63756006188
     tx_zero_rexmt: 56365619
     tx_one_rexmt: 0
     tx_many_rexmt: 0
     tx_late_collision: 0
     tx_fifo_errors: 0
     tx_carrier_errors: 0
     tx_excess_deferral: 0
     tx_retry_error: 0
     rx_frame_error: 0
     rx_extra_byte: 0
     rx_late_collision: 0
     rx_runt: 0
     rx_frame_too_long: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_align_error: 0
     rx_length_error: 0
     rx_unicast: 58975439
     rx_multicast: 933
     rx_broadcast: 1618
     rx_packets: 58977990
     rx_errors_total: 0
     tx_errors_total: 0
     tx_deferral: 0
     tx_packets: 56365619
     rx_bytes: 69269122814
     tx_pause: 0
     rx_pause: 46798
     rx_drop_frame: 46798
     tx_unicast: 2284
     tx_multicast: 3008
     tx_broadcast: 16510200339

If I ifdown eth0 and then ifup eth0, I can again connect to the system without problems.

Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com




Reply to: