Bug#538158: BUG: soft lockup - CPU#5 stuck for 62s with 2.6.26-2-686-bigmem kernel
Package: linux-image-2.6.26-2-686-bigmem
Version: 2.6.26-17
Severity: important
This problem is repeatable on two of our Sun X2200 servers (two * quad-core
Opteron 2376 CPUs and 28GB of RAM). I found a couple of similar bug reports
(#496917 and #536236) , but they are filed agains amd64 kernels. Ours is
the stock x86 bigmem kernel out of Lenny, so I figured it I'd file
a separate report.
This is unlikely to be a hardware issue, because it shows up on two different
systems. Each of them had memtest86+ running for several days before
deployment. Right now the machines are running vanilla 2.6.30.1
kernels from kernel.org, compiled with lenny's config-2.6.26-2-686-bigmem,
and the problem is gone.
The problem is that random CPUs intermittently get locked up, with the
following kernel messages showing repeatedly:
...
[48420.342829] BUG: soft lockup - CPU#5 stuck for 62s! [swapper:0]
[48420.342829] Modules linked in: tcp_diag inet_diag binfmt_misc nfsd
lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 serio_raw shpchp
psmouse pci_hotplug i2c_nforce2 pcspkr joydev button i2c_core evdev
ext3 jbd mbcache sd_mod usbhid hid ff_memless ide_pci_generic amd74xx
ide_core sata_nv ata_generic tg3 libata scsi_mod ehci_hcd ohci_hcd
dock usbcore thermal processor fan thermal_sys
[48420.342829]
[48420.342829] Pid: 0, comm: swapper Not tainted (2.6.26-2-686-bigmem
#1)
[48420.342829] EIP: 0060:[<c011a124>] EFLAGS: 00000246 CPU: 5
[48420.342829] EIP is at native_safe_halt+0x2/0x3
[48420.342829] EAX: f74be000 EBX: c0107656 ECX: 0f07b000 EDX: 00524d4b
[48420.342829] ESI: 00000005 EDI: 00000000 EBP: 00000000 ESP: f74bffa8
[48420.342829] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[48420.342829] CR0: 8005003b CR2: 080f2c58 CR3: 37585000 CR4: 000006f0
[48420.342829] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[48420.342829] DR6: ffff0ff0 DR7: 00000400
[48420.342829] [<c0107683>] default_idle+0x2d/0x53
[48420.342829] [<c01075ce>] cpu_idle+0xab/0xcb
[48420.342829] =======================
...
The CPU#N part of the error message can be anything from 0 to 7. And
the process name in square brackets can also be anything from a system
process to a user-run script.
The machines are pretty much stock Sun X2200 servers with two quad-core
Opteron 2376 CPUs, 28GB of RAM, and one SATA disk. Below is the output
of lspci. Please let me know if you require more information.
trunko:~# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:05.0 VGA compatible controller: ASPEED Technology, Inc. AST2000
05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev b5)
06:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
06:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
trunko:~# free
total used free shared buffers cached
Mem: 29120968 954532 28166436 0 304752 274220
-/+ buffers/cache: 375560 28745408
Swap: 2048248 0 2048248
trunko:~# fdisk -l
Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sda1 1 13 104391 83 Linux
/dev/sda2 14 19457 156183930 5 Extended
/dev/sda5 * 14 64 409626 83 Linux
/dev/sda6 65 319 2048256 82 Linux swap / Solaris
/dev/sda7 320 1212 7172991 83 Linux
/dev/sda8 1213 1340 1028128+ 83 Linux
/dev/sda9 1341 1468 1028128+ 83 Linux
/dev/sda10 1469 2361 7172991 83 Linux
/dev/sda11 2362 19457 137323588+ 83 Linux
trunko:~# lsmod
Module Size Used by
binfmt_misc 7020 1
nfsd 207552 9
exportfs 3712 1 nfsd
nfs 220756 10
lockd 57984 2 nfsd,nfs
nfs_acl 2632 2 nfsd,nfs
auth_rpcgss 32752 2 nfsd,nfs
sunrpc 164612 34 nfsd,nfs,lockd,nfs_acl,auth_rpcgss
ipv6 232800 38
ipmi_si 34828 0
ipmi_msghandler 30676 1 ipmi_si
i2c_nforce2 6248 0
joydev 8800 0
serio_raw 4696 0
psmouse 37468 0
shpchp 27108 0
pci_hotplug 24628 1 shpchp
button 5120 0
processor 34600 0
i2c_core 20880 1 i2c_nforce2
pcspkr 2096 0
evdev 8220 3
ext3 107448 7
jbd 41072 1 ext3
mbcache 6984 1 ext3
sd_mod 23924 9
ide_pci_generic 3624 0
amd74xx 5420 0
ide_core 87756 2 ide_pci_generic,amd74xx
usbhid 31452 0
hid 36068 1 usbhid
sata_nv 19636 8
ata_generic 4332 0
tg3 94696 0
libphy 19512 1 tg3
libata 151032 2 sata_nv,ata_generic
scsi_mod 135076 2 sd_mod,libata
ehci_hcd 30492 0
ohci_hcd 19880 0
usbcore 125860 4 usbhid,ehci_hcd,ohci_hcd
thermal 12664 0
fan 4032 0
thermal_sys 13424 3 processor,thermal,fan
-- Package-specific info:
-- System Information:
Debian Release: 5.0.2
APT prefers stable
APT policy: (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.30.1-i686-bigmem-cdf (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/bash
Versions of packages linux-image-2.6.26-2-686-bigmem depends on:
ii debconf [debconf-2.0] 1.5.24 Debian configuration management sy
ii initramfs-tools [linux-initra 0.92o tools for generating an initramfs
ii module-init-tools 3.4-1 tools for managing Linux kernel mo
Versions of packages linux-image-2.6.26-2-686-bigmem recommends:
ii libc6-i686 2.7-18 GNU C Library: Shared libraries [i
Versions of packages linux-image-2.6.26-2-686-bigmem suggests:
ii grub 0.97-47lenny2 GRand Unified Bootloader (Legacy v
pn linux-doc-2.6.26 <none> (no description available)
-- debconf information:
linux-image-2.6.26-2-686-bigmem/preinst/overwriting-modules-2.6.26-2-686-bigmem: true
shared/kernel-image/really-run-bootloader: true
linux-image-2.6.26-2-686-bigmem/preinst/lilo-has-ramdisk:
linux-image-2.6.26-2-686-bigmem/postinst/bootloader-test-error-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-2.6.26-2-686-bigmem: false
linux-image-2.6.26-2-686-bigmem/preinst/initrd-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/preinst/abort-overwrite-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/preinst/bootloader-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-initrd-2.6.26-2-686-bigmem: false
linux-image-2.6.26-2-686-bigmem/postinst/create-kimage-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/lilo-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/prerm/would-invalidate-boot-loader-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/failed-to-move-modules-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/prerm/removing-running-kernel-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/old-dir-initrd-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/elilo-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/abort-install-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/old-initrd-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/old-system-map-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/bootloader-error-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/kimage-is-a-directory:
Reply to: