[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#490156: linux-image-2.6.24-1-686: SMP (2*hyperthreading xeon) machine wedged in loop saying 'BUG: soft lockup - CPU#N stuck for 11s'



On Thu, Jul 10, 2008 at 11:57:36AM +0100, Simon A. Boggis wrote:
> Package: linux-image-2.6.24-1-686
> Version: 2.6.24-7
> Severity: critical
> Justification: breaks the whole system

overflated severity, learn to set them.
one or two broken boxes doesn't mean the kernel is unusable on the
whole. but everybody like to play selfish oh my bug is that important.
 
 
> I have a number of dual processor xeon machines (hyperthreading cores, Intel SR2400 chassis), giving
> four logical processors thus:
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 4
> model name	: Intel(R) Xeon(TM) CPU 3.00GHz
> stepping	: 1
> cpu MHz		: 2992.689
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 1
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 5
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
> bogomips	: 5989.95
> clflush size	: 64
> 
> processor	: 1
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 4
> model name	: Intel(R) Xeon(TM) CPU 3.00GHz
> stepping	: 1
> cpu MHz		: 2992.689
> cache size	: 1024 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 1
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 5
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
> bogomips	: 5985.43
> clflush size	: 64
> 
> processor	: 2
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 4
> model name	: Intel(R) Xeon(TM) CPU 3.00GHz
> stepping	: 1
> cpu MHz		: 2992.689
> cache size	: 1024 KB
> physical id	: 3
> siblings	: 2
> core id		: 0
> cpu cores	: 1
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 5
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
> bogomips	: 5985.49
> clflush size	: 64
> 
> processor	: 3
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 4
> model name	: Intel(R) Xeon(TM) CPU 3.00GHz
> stepping	: 1
> cpu MHz		: 2992.689
> cache size	: 1024 KB
> physical id	: 3
> siblings	: 2
> core id		: 0
> cpu cores	: 1
> fdiv_bug	: no
> hlt_bug		: no
> f00f_bug	: no
> coma_bug	: no
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 5
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl cid cx16 xtpr
> bogomips	: 5985.49
> clflush size	: 64
> 
> The machines run debian stable with an apt-pinned debian testing (lenny) kernel for some newer features (mainly iptables state 
> tracking).
> 
> I'm running the machines as firewall/routers, and because of this I'm using LACP bonding to
> create two logical 2 gigabit interfaces, each composed of:
> 
>   1 onboard plus one PCI-X e1000
> 
> Today I found saw one of my machines disappear off the network at 0935 - as it disappeared our HP 5400 switch reported an LACP
> error:
>   I 07/10/08 09:35:08 00393 lacp: Port F1 is blocked - error condition
> 
> The machine didn't recover over the course of 20 minutes - when I finally got into the serial console using the onboard IPMI
> management controller I could see that it was stuck in a loop producing the following messages. I wasn't able to get any
> kind of response from it other than this:
> 
> SOL Session operational.  Use ?? for help
> BUG: soft lockup - CPU#3 stuck for 11s! [ebr3:2823]
> 

try out newer 2.6.26-rc9 snapshots, see trunk apt lines
-> http://wiki.debian.org/DebianKernel

-- 
maks



Reply to: