[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#678443: marked as done (Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38)



Your message dated Sat, 6 Jul 2013 18:00:34 +0200
with message-id <20130706160034.GG12523@pisco.westfalen.local>
and subject line Closing
has caused the Debian Bug report #678443,
regarding Hard lockups due to "lockup-detector" (NMIs) on muti-Pentium-3 SMP systems on all kernel builds since 2.6.38
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
678443: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=678443
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: linux-2.6
Version: 2.6.38-5

Hello!

(Backgrund information: the motivation for this bug report comes from bug 639331 which already seems to have dug more deeply into the version introducing this berhaviour)

The NMI watchdog mechanism (as I know today it is the probable source) has given me serious headaches since Debian kernel 2.6.38 was released. I cannot tell it definitely yet as it is an intermittent error in my case which may take up to a week to appear once, and I disabled the NMI watchdog mechanism by adding "nowatchdog" not until yesterday (20120620) when I came across the bug report mentioned above.

A short summary of my problem:

- among several uniprocessor systems with Debian and Ubuntu, I am running several older multiprocessor servers (IBM Netfinity 5000 (Dual-P3 (Coppermine)), IBM Netfinity 7000 M10 (Quad-P3-Xeon (Tanner)) and IBM xSeries 232 (Dual-P3 (Tualatin))) with Debian (using testing as "rolling release" after a long time with lenny)

- the systems were running rock-solid up to and including the Debian-packaged kernel 2.6.32 (all sub-versions).

- when Debian-packaged kernel 2.6.38 came out, my problem started and appeared mainly on the Netfinity 5000 (but less often also on the other systems): after running continuously for one to eight days, the system suddenly locked up hard, in most cases it was just idle when this happened

- this lockup was a classic livelock which can be diagnosed nicely on these IBM machines as they have activity LEDs for each CPU which glowed with identical brightness and without any modulation, so both CPUs were switching between each other with short cycles

- when comparing the basic system data and properties, I noticed a difference between kernel 2.6.32 and 2.6.38: the latter caused a continuously rising NMI count on each CPU which could not be seen with 2.6.32! Today I know where these NMIs are coming from: it is the watchdog mechanism also causing your laptop problem

- I hoped that the problem might disappear with kernel 3.4 as there were a few discussions on LKML about several livelocks/deadlocks related to timers and the like (the config change concerning the "lockup detector" which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me)

- as you see it on the laptop, this lockup NEVER allows to get any message out via the debugging mechanisms, not even by attaching a serial cable and logging the console output on a second machine

- now using kernel 3.4.2, the problem still exists, but has changed a bit in its consequences - instead of a livelock, it is a deadlock in most cases and activity stays on a single CPU, sometimes even causing a reboot instead of staying locked up

- on a German forum I described the problem, but nobody could point me to this lockup-detector change in the kernel config though I posted this significant change from "no NMIs" to "continuous NMIs". Here we see again how bad the documentation of open-source projects sometimes is cared about... even when configuring a kernel, the config help says that the nmi watchdog had to be enabled consciously by a boot parameter - in fact it seems to be activated by default as soon as SMP code is loaded and/or an APIC is detected (but though the presence of an APIC, I have not seen those NMIs on my uniprocessor P3 machines yet).

Here is a link to my description on the German "debianforum": http://debianforum.de/forum/viewtopic.php?f=33&t=134210

I would like to report the bug to http://bugzilla.kernel.org if it has not yet been done by someone else. Therefore it would be great if you could give me a short note if you have reported it already.

Basically I think this mechanism has its bugs and/or wrong assumptions on some machines and should undergo a critical review. I'm wondering if there are more people in the world getting set up by strange lockups of their machines which are wrongly diagnosed as "hardware errors" etc.


Thanks and best regards,

Hans-Juergen



--- End Message ---
--- Begin Message ---
Hi,
your bug has been filed against the "linux-2.6" source package and was filed for
a kernel older than the recently released Debian 7.x / Wheezy with a severity
less than important.

We don't have the ressources to reproduce the complete backlog of all older kernel
bugs, so we're closing this bug for now. If you can reproduce the bug with Debian Wheezy
or a more recent kernel from testing or unstable, please reopen the bug by sending
a mail to control@bugs.debian.org with the following three commands included in the
mail:

reopen BUGNUMBER
reassign BUGNUMBER src:linux
thanks

Cheers,
        Moritz

--- End Message ---

Reply to: