[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#538158: marked as done (BUG: soft lockup - CPU#5 stuck for 62s with 2.6.26-2-686-bigmem kernel)



Your message dated Thu, 5 Jan 2012 09:59:32 -0600
with message-id <20120105155932.GC10774@elie.hsd1.il.comcast.net>
and subject line Re: BUG: soft lockup - CPU#5 stuck for 62s with 2.6.26-2-686-bigmem kernel
has caused the Debian Bug report #538158,
regarding BUG: soft lockup - CPU#5 stuck for 62s with 2.6.26-2-686-bigmem kernel
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
538158: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538158
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: linux-image-2.6.26-2-686-bigmem
Version: 2.6.26-17
Severity: important

This problem is repeatable on two of our Sun X2200 servers (two * quad-core
Opteron 2376 CPUs and 28GB of RAM).  I found a couple of similar bug reports
(#496917 and #536236) , but they are filed agains amd64 kernels.  Ours is
the stock x86 bigmem kernel out of Lenny, so I figured it I'd file
a separate report.

This is unlikely to be a hardware issue, because it shows up on two different 
systems.  Each of them had memtest86+ running for several days before 
deployment.  Right now the machines are running vanilla 2.6.30.1
kernels from kernel.org, compiled with lenny's config-2.6.26-2-686-bigmem,
and the problem is gone.

The problem is that random CPUs intermittently get locked up, with the 
following kernel messages showing repeatedly:

...
[48420.342829] BUG: soft lockup - CPU#5 stuck for 62s! [swapper:0]
[48420.342829] Modules linked in: tcp_diag inet_diag binfmt_misc nfsd
lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 serio_raw shpchp
psmouse pci_hotplug i2c_nforce2 pcspkr joydev button i2c_core evdev
ext3 jbd mbcache sd_mod usbhid hid ff_memless ide_pci_generic amd74xx
ide_core sata_nv ata_generic tg3 libata scsi_mod ehci_hcd ohci_hcd
dock usbcore thermal processor fan thermal_sys
[48420.342829]
[48420.342829] Pid: 0, comm: swapper Not tainted (2.6.26-2-686-bigmem
#1)
[48420.342829] EIP: 0060:[<c011a124>] EFLAGS: 00000246 CPU: 5
[48420.342829] EIP is at native_safe_halt+0x2/0x3
[48420.342829] EAX: f74be000 EBX: c0107656 ECX: 0f07b000 EDX: 00524d4b
[48420.342829] ESI: 00000005 EDI: 00000000 EBP: 00000000 ESP: f74bffa8
[48420.342829]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[48420.342829] CR0: 8005003b CR2: 080f2c58 CR3: 37585000 CR4: 000006f0
[48420.342829] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[48420.342829] DR6: ffff0ff0 DR7: 00000400
[48420.342829]  [<c0107683>] default_idle+0x2d/0x53
[48420.342829]  [<c01075ce>] cpu_idle+0xab/0xcb
[48420.342829]  =======================
...

The CPU#N part of the error message can be anything from 0 to 7.  And
the process name in square brackets can also be anything from a system 
process to a user-run script.

The machines are pretty much stock Sun X2200 servers with two quad-core 
Opteron 2376 CPUs, 28GB of RAM, and one SATA disk.  Below is the output
of lspci.  Please let me know if you require more information.

trunko:~# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:05.0 VGA compatible controller: ASPEED Technology, Inc. AST2000
05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev b5)
06:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
06:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)

trunko:~# free
             total       used       free     shared    buffers     cached
Mem:      29120968     954532   28166436          0     304752     274220
-/+ buffers/cache:     375560   28745408
Swap:      2048248          0    2048248

trunko:~# fdisk -l

Disk /dev/sda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1          13      104391   83  Linux
/dev/sda2              14       19457   156183930    5  Extended
/dev/sda5   *          14          64      409626   83  Linux
/dev/sda6              65         319     2048256   82  Linux swap / Solaris
/dev/sda7             320        1212     7172991   83  Linux
/dev/sda8            1213        1340     1028128+  83  Linux
/dev/sda9            1341        1468     1028128+  83  Linux
/dev/sda10           1469        2361     7172991   83  Linux
/dev/sda11           2362       19457   137323588+  83  Linux

trunko:~# lsmod
Module                  Size  Used by
binfmt_misc             7020  1 
nfsd                  207552  9 
exportfs                3712  1 nfsd
nfs                   220756  10 
lockd                  57984  2 nfsd,nfs
nfs_acl                 2632  2 nfsd,nfs
auth_rpcgss            32752  2 nfsd,nfs
sunrpc                164612  34 nfsd,nfs,lockd,nfs_acl,auth_rpcgss
ipv6                  232800  38 
ipmi_si                34828  0 
ipmi_msghandler        30676  1 ipmi_si
i2c_nforce2             6248  0 
joydev                  8800  0 
serio_raw               4696  0 
psmouse                37468  0 
shpchp                 27108  0 
pci_hotplug            24628  1 shpchp
button                  5120  0 
processor              34600  0 
i2c_core               20880  1 i2c_nforce2
pcspkr                  2096  0 
evdev                   8220  3 
ext3                  107448  7 
jbd                    41072  1 ext3
mbcache                 6984  1 ext3
sd_mod                 23924  9 
ide_pci_generic         3624  0 
amd74xx                 5420  0 
ide_core               87756  2 ide_pci_generic,amd74xx
usbhid                 31452  0 
hid                    36068  1 usbhid
sata_nv                19636  8 
ata_generic             4332  0 
tg3                    94696  0 
libphy                 19512  1 tg3
libata                151032  2 sata_nv,ata_generic
scsi_mod              135076  2 sd_mod,libata
ehci_hcd               30492  0 
ohci_hcd               19880  0 
usbcore               125860  4 usbhid,ehci_hcd,ohci_hcd
thermal                12664  0 
fan                     4032  0 
thermal_sys            13424  3 processor,thermal,fan


-- Package-specific info:

-- System Information:
Debian Release: 5.0.2
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.30.1-i686-bigmem-cdf (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/bash

Versions of packages linux-image-2.6.26-2-686-bigmem depends on:
ii  debconf [debconf-2.0]         1.5.24     Debian configuration management sy
ii  initramfs-tools [linux-initra 0.92o      tools for generating an initramfs
ii  module-init-tools             3.4-1      tools for managing Linux kernel mo

Versions of packages linux-image-2.6.26-2-686-bigmem recommends:
ii  libc6-i686                    2.7-18     GNU C Library: Shared libraries [i

Versions of packages linux-image-2.6.26-2-686-bigmem suggests:
ii  grub                       0.97-47lenny2 GRand Unified Bootloader (Legacy v
pn  linux-doc-2.6.26           <none>        (no description available)

-- debconf information:
  linux-image-2.6.26-2-686-bigmem/preinst/overwriting-modules-2.6.26-2-686-bigmem: true
  shared/kernel-image/really-run-bootloader: true
  linux-image-2.6.26-2-686-bigmem/preinst/lilo-has-ramdisk:
  linux-image-2.6.26-2-686-bigmem/postinst/bootloader-test-error-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-2.6.26-2-686-bigmem: false
  linux-image-2.6.26-2-686-bigmem/preinst/initrd-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/preinst/abort-overwrite-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/preinst/bootloader-initrd-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/postinst/depmod-error-initrd-2.6.26-2-686-bigmem: false
  linux-image-2.6.26-2-686-bigmem/postinst/create-kimage-link-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/preinst/lilo-initrd-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/prerm/would-invalidate-boot-loader-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/preinst/failed-to-move-modules-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/prerm/removing-running-kernel-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/postinst/old-dir-initrd-link-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/preinst/elilo-initrd-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/preinst/abort-install-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/postinst/old-initrd-link-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/postinst/old-system-map-link-2.6.26-2-686-bigmem: true
  linux-image-2.6.26-2-686-bigmem/postinst/bootloader-error-2.6.26-2-686-bigmem:
  linux-image-2.6.26-2-686-bigmem/postinst/kimage-is-a-directory:



--- End Message ---
--- Begin Message ---
Arcady Genkin wrote:

> We have not seen this bug in a very long while now.  I can't tell for
> sure, but it feels like at least a year.
>
> The servers are currently running linux-image-2.6-686-bigmem
> 2.6.26+17+lenny1 kernel.
>
> That said, these servers are always quite busy, even during the
> holidays, so if the bug is related to prolonged idle periods, as was
> hypothesized before, then there is no wonder it is not happening.

Closing, but please let us know if you get another chance to try on an
idle machine.

Many thanks,
Jonathan


--- End Message ---

Reply to: