[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0



Summary for the SPARC maintainers:

The NMI watchdog is firing on Sunfire 280R and Sun Blade2500 systems
with one or both processors in cheetah_xcall_deliver().  This has been
seen under 3.0, 3.2 and 3.3 and seems to be associated with disk I/O.

Full bug log is at: http://bugs.debian.org/648766

On Tue, 2012-04-03 at 20:56 -0400, Kieron Gillespie wrote:
> I have also noticed, that if I am reading the trace correctly that in 
> both of my cases, and the original bug submitter's, and a bug posted on 
> old.nabble.com's case the crash always seems to happen when one CPU is 
> doing cheetah_xcall_deliver, and the other CPU is in the same 
> instruction in tl0_irq15. Here is a link to the post.
[...]

tl0_irq15 seems to be part of the NMI watchdog (for detecting that the
kernel has hung), so you should always see that in a backtrace when the
NMI watchdog fires.  It's not part of the problem.

cheetah_xcall_deliver() does appear to be relevant to the problem and it
looks like it could loop indefinitely - though presumably only if a
processor is behaving strangely?  It appears to periodically enable and
disable interrupts, but then I'm not sure how the PSTATE.IE and PIL
interrupt control fields interact and I don't think this will reset the
NMI watchdog.  In any case, it seems like there's a serious problem if
it's looping for a long time, whether or not interrupts remain disabled.

Ben.

-- 
Ben Hutchings
Larkinson's Law: All laws are basically false.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: