On Tue, 2010-07-20 at 22:51 +0100, Ian Jackson wrote: > Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"): > > Please try 2.6.34 from experimental. > > I've now replicated the problem on my coffee table with a temporary > (intermittent) loan of an identical machine from Jump networks. > I can get 2.6.26-21lenny4 and 2.6.26-24 to crash on demand very > easily. In none of the crashes do I get any kind of kernel log > messages (eg, soft lockup warnings). [...] But I do see a deadlock on the barrier between resync and normal I/O, as you suspected. Every process is blocked in wait_barrier(), raise_barrier() or a filesystem lock. The only explanations I can think of are: 1. The regular I/O requests were miscounted so that the resync process will wait forever for them to complete in raise_barrier(), while the other processes wait forever for the barrier to be lowered. (It seems like the directions of this barrier are inverted. Maybe it is really a bollard.) 2. The wait condition of one of the waiters was met, but it wasn't woken. 3. A thread which raised the barrier failed to lower the barrier. 4. A thread which raised the barrier caused paging. Unfortunately I can't see any evidence for any of these. The following patch should give us some chance of detecting cases 1 and 3. The instructions at <http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official> explain how to rebuild an official kernel package with a patch applied. Ben. --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -657,6 +657,7 @@ /* block any new IO from starting */ conf->barrier++; + WARN_ON(conf->barrier != 1); /* No wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, @@ -671,6 +672,7 @@ { unsigned long flags; spin_lock_irqsave(&conf->resync_lock, flags); + WARN_ON(conf->barrier == 0); conf->barrier--; spin_unlock_irqrestore(&conf->resync_lock, flags); wake_up(&conf->wait_barrier); @@ -694,6 +696,7 @@ { unsigned long flags; spin_lock_irqsave(&conf->resync_lock, flags); + WARN_ON(conf->nr_pending == 0); conf->nr_pending--; spin_unlock_irqrestore(&conf->resync_lock, flags); wake_up(&conf->wait_barrier); --- END ---- Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse.
Attachment:
signature.asc
Description: This is a digitally signed message part