Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
currently the system starts reaching an amount of uptime that was hardly
possible before setting "nowatchdog":
netfinity5000:~# uptime
13:43:12 up 10 days, 20:43, 2 users, load average: 0,01, 0,08, 0,07
When we reach 14 days or more, we know that it's really the watchdog/NMI
"feature" causing these SMP systems to lock up intermittently but quite
deterministic after an uptime of 1 to 8 days.
To avoid any side-effects while testing, I did not change anything on
the system except this kernel boot parameter after the last lockup those
10 days ago. No software updates, no additional change to the kernel
(this means the current kernel produced at least one "successful" lockup
as I had tried various configurations and versions before the hint to
the NMI/watchdog issue gained my full attention).
After having me frustrated for months, I have quite a detailed
impression of this misbehaviour and nothing ever made me feel that
confident in restored reliability than setting this boot parameter.
Here is my current interrupt state:
netfinity5000:~# cat /proc/interrupts
CPU0 CPU1
0: 49 0 IO-APIC-edge timer
1: 3 0 IO-APIC-edge i8042
6: 3 0 IO-APIC-edge floppy
7: 1 0 IO-APIC-edge parport0
8: 0 0 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 1 3 IO-APIC-edge i8042
14: 42 74 IO-APIC-edge ata_generic
15: 0 0 IO-APIC-edge ata_generic
16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx
17: 19391683 19362804 IO-APIC-fasteoi eth0
18: 649647 660452 IO-APIC-fasteoi megaraid, ohci_hcd:usb2
19: 8761472 8704241 IO-APIC-fasteoi eth1
22: 11804557 11924853 IO-APIC-fasteoi ehci_hcd:usb1,
ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3
NMI: 1 1 Non-maskable interrupts
LOC: 62410645 76099188 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 0 0 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 2 0 APIC ICR read retries
RES: 1628056 1619691 Rescheduling interrupts
CAL: 293382 396292 Function call interrupts
TLB: 211292 194994 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 3129 3129 Machine check polls
ERR: 0
MIS: 0
Here are my boot parameters and the reboot date since which the system
has been running flawlessly:
Jun 20 17:01:49 netfinity5000 kernel: [ 0.000000] Kernel command
line: auto BOOT_IMAGE=Linux ro
root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8
aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq
libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog
Just for comparison: before this, reboots/lockups occured on June 4th,
June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th.
If you need more information like a full kernel boot log or whatever,
just ask me.
Thanks and best regards,
Hans-Juergen
Reply to: