[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0



On Sat, 2012-04-07 at 13:40 -0400, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sat, 07 Apr 2012 18:21:38 +0100
> 
> > cheetah_xcall_deliver() does appear to be relevant to the problem and it
> > looks like it could loop indefinitely - though presumably only if a
> > processor is behaving strangely?
> 
> I can only loop indefinitely if one of the cpus is hung and
> does not respond to the cross-call interrupt.

Well, it has to keep responding with a NACK, right?

Will the recipient NACK if the cross-call interrupt is disabled, or do
the processors have a buffer/FIFO for such IRQs?

If there's no buffer then what stops two processors live-locking here?
(The 'random time' where some interrupts are enabled isn't really very
random, so it seems to be possible for two processors to go round the
loop in lock-step.)

> > It appears to periodically enable and disable interrupts, but then
> > I'm not sure how the PSTATE.IE and PIL interrupt control fields
> > interact and I don't think this will reset the NMI watchdog.
> 
> PSTATE.IE controls delivery of all interrupts, both PIL
> based and vectored interrupts.
>
> PIL only controls delivery of PIL interrupts.
> 
> The NMI watchdog interrupt is a special PIL interrupt, and
> most of the standard local_irq_disable() et al. routines on
> sparc will adjust the %pil such that NMI watchdog interrupts
> are still delivered.
> 
> See include/asm/pil.h for details.

Obviously the NMI watchdog is not being disabled, but I was wondering
how its timer gets reset.

Having RTFS, it appears that it is never really reset but is held off by
either an hrtimer interrupt or an explicit call to watchdog_nmi_touch()
during each interval.  If I'm not mistaken, the hrtimer interrupt is
being disabled by xcall_deliver() and remains disabled.

Ben.

-- 
Ben Hutchings
If more than one person is responsible for a bug, no one is at fault.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: