--- Begin Message ---
- To: "Debian Bug Tracking System" <submit@bugs.debian.org>
- Subject: BUG: soft lockup - CPU#5 stuck for 62s with 2.6.26-2-686-bigmem kernel
- From: "Arcady Genkin" <agenkin@cdf.toronto.edu>
- Date: 23 Jul 2009 12:33:46 -0400
- Message-id: <20090723163346.26626.36611.reportbug@trunko>
Package: linux-image-2.6.26-2-686-bigmem
Version: 2.6.26-17
Severity: important
This problem is repeatable on two of our Sun X2200 servers (two * quad-core
Opteron 2376 CPUs and 28GB of RAM). I found a couple of similar bug reports
(#496917 and #536236) , but they are filed agains amd64 kernels. Ours is
the stock x86 bigmem kernel out of Lenny, so I figured it I'd file
a separate report.
This is unlikely to be a hardware issue, because it shows up on two different
systems. Each of them had memtest86+ running for several days before
deployment. Right now the machines are running vanilla 2.6.30.1
kernels from kernel.org, compiled with lenny's config-2.6.26-2-686-bigmem,
and the problem is gone.
The problem is that random CPUs intermittently get locked up, with the
following kernel messages showing repeatedly:
...
[48420.342829] BUG: soft lockup - CPU#5 stuck for 62s! [swapper:0]
[48420.342829] Modules linked in: tcp_diag inet_diag binfmt_misc nfsd
lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 serio_raw shpchp
psmouse pci_hotplug i2c_nforce2 pcspkr joydev button i2c_core evdev
ext3 jbd mbcache sd_mod usbhid hid ff_memless ide_pci_generic amd74xx
ide_core sata_nv ata_generic tg3 libata scsi_mod ehci_hcd ohci_hcd
dock usbcore thermal processor fan thermal_sys
[48420.342829]
[48420.342829] Pid: 0, comm: swapper Not tainted (2.6.26-2-686-bigmem
#1)
[48420.342829] EIP: 0060:[<c011a124>] EFLAGS: 00000246 CPU: 5
[48420.342829] EIP is at native_safe_halt+0x2/0x3
[48420.342829] EAX: f74be000 EBX: c0107656 ECX: 0f07b000 EDX: 00524d4b
[48420.342829] ESI: 00000005 EDI: 00000000 EBP: 00000000 ESP: f74bffa8
[48420.342829] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[48420.342829] CR0: 8005003b CR2: 080f2c58 CR3: 37585000 CR4: 000006f0
[48420.342829] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[48420.342829] DR6: ffff0ff0 DR7: 00000400
[48420.342829] [<c0107683>] default_idle+0x2d/0x53
[48420.342829] [<c01075ce>] cpu_idle+0xab/0xcb
[48420.342829] =======================
...
The CPU#N part of the error message can be anything from 0 to 7. And
the process name in square brackets can also be anything from a system
process to a user-run script.
The machines are pretty much stock Sun X2200 servers with two quad-core
Opteron 2376 CPUs, 28GB of RAM, and one SATA disk. Below is the output
of lspci. Please let me know if you require more information.
trunko:~# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:05.0 VGA compatible controller: ASPEED Technology, Inc. AST2000
05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev b5)
06:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
06:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
trunko:~# free
total used free shared buffers cached
Mem: 29120968 954532 28166436 0 304752 274220
-/+ buffers/cache: 375560 28745408
Swap: 2048248 0 2048248
trunko:~# fdisk -l
Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sda1 1 13 104391 83 Linux
/dev/sda2 14 19457 156183930 5 Extended
/dev/sda5 * 14 64 409626 83 Linux
/dev/sda6 65 319 2048256 82 Linux swap / Solaris
/dev/sda7 320 1212 7172991 83 Linux
/dev/sda8 1213 1340 1028128+ 83 Linux
/dev/sda9 1341 1468 1028128+ 83 Linux
/dev/sda10 1469 2361 7172991 83 Linux
/dev/sda11 2362 19457 137323588+ 83 Linux
trunko:~# lsmod
Module Size Used by
binfmt_misc 7020 1
nfsd 207552 9
exportfs 3712 1 nfsd
nfs 220756 10
lockd 57984 2 nfsd,nfs
nfs_acl 2632 2 nfsd,nfs
auth_rpcgss 32752 2 nfsd,nfs
sunrpc 164612 34 nfsd,nfs,lockd,nfs_acl,auth_rpcgss
ipv6 232800 38
ipmi_si 34828 0
ipmi_msghandler 30676 1 ipmi_si
i2c_nforce2 6248 0
joydev 8800 0
serio_raw 4696 0
psmouse 37468 0
shpchp 27108 0
pci_hotplug 24628 1 shpchp
button 5120 0
processor 34600 0
i2c_core 20880 1 i2c_nforce2
pcspkr 2096 0
evdev 8220 3
ext3 107448 7
jbd 41072 1 ext3
mbcache 6984 1 ext3
sd_mod 23924 9
ide_pci_generic 3624 0
amd74xx 5420 0
ide_core 87756 2 ide_pci_generic,amd74xx
usbhid 31452 0
hid 36068 1 usbhid
sata_nv 19636 8
ata_generic 4332 0
tg3 94696 0
libphy 19512 1 tg3
libata 151032 2 sata_nv,ata_generic
scsi_mod 135076 2 sd_mod,libata
ehci_hcd 30492 0
ohci_hcd 19880 0
usbcore 125860 4 usbhid,ehci_hcd,ohci_hcd
thermal 12664 0
fan 4032 0
thermal_sys 13424 3 processor,thermal,fan
-- Package-specific info:
-- System Information:
Debian Release: 5.0.2
APT prefers stable
APT policy: (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.30.1-i686-bigmem-cdf (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/bash
Versions of packages linux-image-2.6.26-2-686-bigmem depends on:
ii debconf [debconf-2.0] 1.5.24 Debian configuration management sy
ii initramfs-tools [linux-initra 0.92o tools for generating an initramfs
ii module-init-tools 3.4-1 tools for managing Linux kernel mo
Versions of packages linux-image-2.6.26-2-686-bigmem recommends:
ii libc6-i686 2.7-18 GNU C Library: Shared libraries [i
Versions of packages linux-image-2.6.26-2-686-bigmem suggests:
ii grub 0.97-47lenny2 GRand Unified Bootloader (Legacy v
pn linux-doc-2.6.26 <none> (no description available)
-- debconf information:
linux-image-2.6.26-2-686-bigmem/preinst/overwriting-modules-2.6.26-2-686-bigmem: true
shared/kernel-image/really-run-bootloader: true
linux-image-2.6.26-2-686-bigmem/preinst/lilo-has-ramdisk:
linux-image-2.6.26-2-686-bigmem/postinst/bootloader-test-error-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-2.6.26-2-686-bigmem: false
linux-image-2.6.26-2-686-bigmem/preinst/initrd-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/preinst/abort-overwrite-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/preinst/bootloader-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-initrd-2.6.26-2-686-bigmem: false
linux-image-2.6.26-2-686-bigmem/postinst/create-kimage-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/lilo-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/prerm/would-invalidate-boot-loader-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/failed-to-move-modules-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/prerm/removing-running-kernel-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/old-dir-initrd-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/elilo-initrd-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/preinst/abort-install-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/old-initrd-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/old-system-map-link-2.6.26-2-686-bigmem: true
linux-image-2.6.26-2-686-bigmem/postinst/bootloader-error-2.6.26-2-686-bigmem:
linux-image-2.6.26-2-686-bigmem/postinst/kimage-is-a-directory:
--- End Message ---
--- Begin Message ---
Arcady Genkin wrote:
> We have not seen this bug in a very long while now. I can't tell for
> sure, but it feels like at least a year.
>
> The servers are currently running linux-image-2.6-686-bigmem
> 2.6.26+17+lenny1 kernel.
>
> That said, these servers are always quite busy, even during the
> holidays, so if the bug is related to prolonged idle periods, as was
> hypothesized before, then there is no wonder it is not happening.
Closing, but please let us know if you get another chance to try on an
idle machine.
Many thanks,
Jonathan
--- End Message ---