[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#490156: Info received (Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s')



Hi,

just to add a little more information, I've seen this bug again on an
identical set of hardware, running an identical (debian preseed
installed) copy of debian, also on the (now previous) version of the
testing kernel: 2.6.24-1-686. I've attached the log from IPMI serial
console attached.

One thing that I might not have made completely clear last time (sorry
about this, if so) is that the 'BUG: soft lockup...' lines all relate to
the bonding driver:

(previous crash)
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#0 stuck for 11s! [ospf6EBX: f77a7bf8 ECX:
00000000 EDX: f8c6428e
 [<c0103e5e>] sysenter_past_esp+BUG: soft lockup - CPU#3 stuck for 11s!
[ebr3:2823]
 [<c0255e05>] sys_socketcall+0x204/0x26BUG: soft lockup - CPU#3 stuck
for 11s! [ebr3:2823]
 [BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [<c02BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [<c025460b>] sys_sBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
 [<c025460b>] sys_setsockopt+0xBUG: soft lockup - CPU#3 stuck for 11s!
[ebr3:2823]
BUG: soft lockup - CPU#0 stuck for 11s! [ospf6d:3647]
 [<c0255e05>] sys_sockBUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
BUG: soft lockup -__write_lock_failed+0x9/0x1c

On the above machine both 'ebr3' and 'etrA' are both bonded interfaces:
$ grep ^ /sys/class/net/{ebr3,etrA}/bonding/*
/sys/class/net/ebr3/bonding/ad_actor_key:17
/sys/class/net/ebr3/bonding/ad_aggregator:1
/sys/class/net/ebr3/bonding/ad_num_ports:2
/sys/class/net/ebr3/bonding/ad_partner_key:291
/sys/class/net/ebr3/bonding/ad_partner_mac:00:17:a4:b3:2b:00
/sys/class/net/ebr3/bonding/arp_interval:0
/sys/class/net/ebr3/bonding/arp_validate:none 0
/sys/class/net/ebr3/bonding/downdelay:0
Binary file /sys/class/net/ebr3/bonding/fail_over_mac matches
/sys/class/net/ebr3/bonding/lacp_rate:slow 0
/sys/class/net/ebr3/bonding/miimon:100
/sys/class/net/ebr3/bonding/mii_status:up
/sys/class/net/ebr3/bonding/mode:802.3ad 4
/sys/class/net/ebr3/bonding/slaves:etbA etbC
/sys/class/net/ebr3/bonding/updelay:0
/sys/class/net/ebr3/bonding/use_carrier:1
/sys/class/net/ebr3/bonding/xmit_hash_policy:layer2 0
/sys/class/net/etrA/bonding/ad_actor_key:17
/sys/class/net/etrA/bonding/ad_aggregator:1
/sys/class/net/etrA/bonding/ad_num_ports:2
/sys/class/net/etrA/bonding/ad_partner_key:290
/sys/class/net/etrA/bonding/ad_partner_mac:00:17:a4:b3:2b:00
/sys/class/net/etrA/bonding/arp_interval:0
/sys/class/net/etrA/bonding/arp_validate:none 0
/sys/class/net/etrA/bonding/downdelay:0
Binary file /sys/class/net/etrA/bonding/fail_over_mac matches
/sys/class/net/etrA/bonding/lacp_rate:slow 0
/sys/class/net/etrA/bonding/miimon:100
/sys/class/net/etrA/bonding/mii_status:up
/sys/class/net/etrA/bonding/mode:802.3ad 4
/sys/class/net/etrA/bonding/slaves:etbB etbD
/sys/class/net/etrA/bonding/updelay:0
/sys/class/net/etrA/bonding/use_carrier:1
/sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0

(current crash)
BUG: soft lockup - CPU#1 stuck +0xf/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 st/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
 [<c025460b>BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft l_lock_failed+0x9/0x1c
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
 [<c025460b>] sys_setsocBUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
 [<c0135455>] autBUG: soft lockup - CPU#1 stuck for 11s! [ospfd:6839]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#3 stuck for 11s! [etrA:4443]
BUG: soft lockup - CPU#1 stuck for f6279bf8 EBX: f6279bf8 ECX: 00000000
EDX: f8d1828e

On the above machine etrA is a bonded interface:
$ grep ^ /sys/class/net/etrA/bonding/*
/sys/class/net/etrA/bonding/ad_actor_key:17
/sys/class/net/etrA/bonding/ad_aggregator:1
/sys/class/net/etrA/bonding/ad_num_ports:2
/sys/class/net/etrA/bonding/ad_partner_key:292
/sys/class/net/etrA/bonding/ad_partner_mac:00:17:08:ca:6a:00
/sys/class/net/etrA/bonding/arp_interval:0
/sys/class/net/etrA/bonding/arp_validate:none 0
/sys/class/net/etrA/bonding/downdelay:0
Binary file /sys/class/net/etrA/bonding/fail_over_mac matches
/sys/class/net/etrA/bonding/lacp_rate:slow 0
/sys/class/net/etrA/bonding/miimon:100
/sys/class/net/etrA/bonding/mii_status:up
/sys/class/net/etrA/bonding/mode:802.3ad 4
/sys/class/net/etrA/bonding/slaves:etbB etbD
/sys/class/net/etrA/bonding/updelay:0
/sys/class/net/etrA/bonding/use_carrier:1
/sys/class/net/etrA/bonding/xmit_hash_policy:layer2 0

I note an interesting exchange for ubuntu, concerning ubuntu 8.04 server
with a 2.6.24 kernel:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/245779

which contains this interesting information:

  * Does not affect kernel 2.6.24-18
  * Bug introduced with 2.6.24-19
  * BIOS update fixes the problem for (at least some) people.

The ubuntu changelog between these versions can be viewed here:

http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_2.6.24-19.36/changelog

My hardware isn't the same as the other people in the bug report (I've
got a Viglen branded Intel chassis with Intel Server Board SE7520JR2
with dual Intel Xeon 3.00GHz), although it might well be the same enough
for me to have the same problem. I'm up-to-date with the BIOS.

I'm still not exactly clear what the bug is (is in the bonding module,
or more general) and therefore unsure how to tell which version of the
kernel I ought to aim for my other machines (another 12 of them). If
anyone has any clues to steer my research in the right direction I would
be most grateful.

If not, I'm intending to move them cautiously across to a 2.6.26 to see
what happens - unfortunately this process is a little bit slow.

Best wishes,

Simon


Reply to: