[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#542250: repeatable crashes while copying 500G from NFS mount to local logical volume



> On Wed, 2009-08-19 at 22:36 +0400, Nikita V. Youshchenko wrote:
> > tags 542250 +patch
> > thanks
> >
> > > ... I may guess that line 74 should check for in_interrupt() instead
> > > of in_softirq().
> >
> > I've tried that and it really fixed the problem. Server already runs
> > the same backup procedure for several hours. Previously it crashed
> > within 15 minutes.
> >
> > Here is the patch I've applied:
> >
> > --- a/drivers/xen/core/spinlock.c       2009-08-19 16:20:17.000000000
> > +0400 +++ b/drivers/xen/core/spinlock.c       2009-08-19
> > 17:36:55.000000000 +0400 @@ -71,7 +71,7 @@
> >                         BUG_ON(__get_cpu_var(spinning_bh).lock ==
> > lock); spinning = &__get_cpu_var(spinning_irq); } else {
> > -                       BUG_ON(!in_softirq());
> > +                       BUG_ON(!in_interrupt());
> >                         spinning = &__get_cpu_var(spinning_bh);
> >                 }
> >                 BUG_ON(spinning->lock);
>
> I'm glad it works for you, but it isn't a proper fix.

Could you please explain? How that code line cod hit if not in interrupt 
handler?

Here is my understanding of the logic of that code. They try to track 
spinlocks CPU currently spins at. CPU spinning may be interrupted only by 
irq. There "normal" (not SA_NODELAY) interrupt handlers can't be active at 
the same CPU at the same time. That leads to maximum 3 spinings:
- one from process context,
- one from "normal" irq handler that interrupted that process context,
- and one from SA_NODELAY irq handler that interrupted normal irq handler. 
This one can't be interrupted since it runs with interrupts disabled.

If such, the code path in question corresponds to "normal" interrupt 
handler starting to spin. Thus it should be in_interrupt().

How this is wrong?

Perhaps softirq handler could be activated at exit of the "normal" handler? 
Maybe better check is BUG_ON(!in_interrupt() && !in_softrq()). Need to 
check the code ...

Nikita

Attachment: signature.asc
Description: This is a digitally signed message part.


Reply to: