[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38



Hello,

currently the system starts reaching an amount of uptime that was hardly possible before setting "nowatchdog":

netfinity5000:~# uptime
 13:43:12 up 10 days, 20:43,  2 users,  load average: 0,01, 0,08, 0,07

When we reach 14 days or more, we know that it's really the watchdog/NMI "feature" causing these SMP systems to lock up intermittently but quite deterministic after an uptime of 1 to 8 days.

To avoid any side-effects while testing, I did not change anything on the system except this kernel boot parameter after the last lockup those 10 days ago. No software updates, no additional change to the kernel (this means the current kernel produced at least one "successful" lockup as I had tried various configurations and versions before the hint to the NMI/watchdog issue gained my full attention).

After having me frustrated for months, I have quite a detailed impression of this misbehaviour and nothing ever made me feel that confident in restored reliability than setting this boot parameter.

Here is my current interrupt state:

netfinity5000:~# cat /proc/interrupts
           CPU0       CPU1
  0:         49          0   IO-APIC-edge      timer
  1:          3          0   IO-APIC-edge      i8042
  6:          3          0   IO-APIC-edge      floppy
  7:          1          0   IO-APIC-edge      parport0
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          1          3   IO-APIC-edge      i8042
 14:         42         74   IO-APIC-edge      ata_generic
 15:          0          0   IO-APIC-edge      ata_generic
 16:         49         48   IO-APIC-fasteoi   aic7xxx, aic7xxx
 17:   19391683   19362804   IO-APIC-fasteoi   eth0
 18:     649647     660452   IO-APIC-fasteoi   megaraid, ohci_hcd:usb2
 19:    8761472    8704241   IO-APIC-fasteoi   eth1
22: 11804557 11924853 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3
NMI:          1          1   Non-maskable interrupts
LOC:   62410645   76099188   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
IWI:          0          0   IRQ work interrupts
RTR:          2          0   APIC ICR read retries
RES:    1628056    1619691   Rescheduling interrupts
CAL:     293382     396292   Function call interrupts
TLB:     211292     194994   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:       3129       3129   Machine check polls
ERR:          0
MIS:          0

Here are my boot parameters and the reboot date since which the system has been running flawlessly:

Jun 20 17:01:49 netfinity5000 kernel: [ 0.000000] Kernel command line: auto BOOT_IMAGE=Linux ro root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8 aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog

Just for comparison: before this, reboots/lockups occured on June 4th, June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th.


If you need more information like a full kernel boot log or whatever, just ask me.


Thanks and best regards,

Hans-Juergen





Reply to: