[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#639331: linux-image-2.6.36-rc6-686-bigmem: Closing laptop lid hangs the system on Dell studio 1555



Hello!

I am very happy having found this bug report as it is possible that the NMI watchdog mechanism has given me serious headaches since Debian kernel 2.6.38 was released! I cannot tell it definitely yet as it is an intermittent error in my case which may take up to a week to appear once, and I disabled the NMI watchdog mechanism by adding "nowatchdog" not until a few hours ago when I came across this bug report.

A short summary of my problem:

- among several uniprocessor systems with Debian and Ubuntu, I am running several older multiprocessor servers (IBM Netfinity 5000 (Dual-P3), IBM Netfinity 7000 M10 (Quad-P3-Xeon) and IBM xSeries 232 (Dual P3-Tualatin)) with Debian (using testing as "rolling release" after a long time with lenny)

- the systems were running rock-solid up to and including the Debian-packaged kernel 2.6.32

- when Debian-packaged kernel 2.6.38 came out, my problem started and appeared mainly on the Netfinity 5000 (but less often also on the other systems): after running continuously for one to eight days, the system suddenly locked up hard, in most cases it was just idle when this happened

- this lockup was a classic livelock which can be diagnosed nicely on these IBM machines as they have activity LEDs for each CPU which glowed with identical brightness and without any modulation, so both CPUs were switching between each other with short cycles

- when comparing the basic system data and properties, I noticed a difference between kernel 2.6.32 and 2.6.38: the latter caused a continuously rising NMI count on each CPU which could not be seen with 2.6.32! Today I know where these NMIs are coming from: it is the watchdog mechanism also causing your laptop problem

- I hoped that the problem might disappear with kernel 3.4 as there were a few discussions on LKML about several livelocks/deadlocks related to timers and the like (the config change concerning the "lockup detector" which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me)

- as you see it on the laptop, this lockup NEVER allows to get any message out via the debugging mechanisms, not even by attaching a serial cable and logging the console output on a second machine

- now using kernel 3.4.2, the problem still exists, but has changed a bit in its consequences - instead of a livelock, it is a deadlock in most cases and activity stays on a single CPU, sometimes even causing a reboot instead of staying locked up

- on a German forum I described the problem, but nobody could point me to this lockup-detector change in the kernel config though I posted this significant change from "no NMIs" to "continuous NMIs". Here we see again how bad the documentation of open-source projects sometimes is cared about... even when configuring a kernel, the config help says that the nmi watchdog had to be enabled consciously by a boot parameter - in fact it seems to be activated by default as soon as SMP code is loaded and/or an APIC is detected (but though the presence of an APIC, I have not seen those NMIs on my uniprocessor P3 machines yet).

Here is a link to my description on the German "debianforum": http://debianforum.de/forum/viewtopic.php?f=33&t=134210

I would like to report the bug to http://bugzilla.kernel.org if it has not yet been done by someone else. Therefore it would be great if you could give me a short note if you have reported it already.

Basically I think this mechanism has its bugs and/or wrong assumptions on some machines and should undergo a critical review. I'm wondering if there are more people in the world getting set up by strange lockups of their machines which are wrongly diagnosed as "hardware errors" etc.

Hope to read from you soon!

Thanks and best regards,

Hans-Juergen



Reply to: