[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated



On Tue, 2010-07-20 at 22:51 +0100, Ian Jackson wrote: 
> Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> > Please try 2.6.34 from experimental.
> 
> I've now replicated the problem on my coffee table with a temporary
> (intermittent) loan of an identical machine from Jump networks.
> I can get 2.6.26-21lenny4 and 2.6.26-24 to crash on demand very
> easily.  In none of the crashes do I get any kind of kernel log
> messages (eg, soft lockup warnings).
[...]

But I do see a deadlock on the barrier between resync and normal I/O, as
you suspected.  Every process is blocked in wait_barrier(),
raise_barrier() or a filesystem lock.

The only explanations I can think of are:
1. The regular I/O requests were miscounted so that the resync process
will wait forever for them to complete in raise_barrier(), while the
other processes wait forever for the barrier to be lowered.  (It seems
like the directions of this barrier are inverted.  Maybe it is really a
bollard.)
2. The wait condition of one of the waiters was met, but it wasn't
woken.
3. A thread which raised the barrier failed to lower the barrier.
4. A thread which raised the barrier caused paging.

Unfortunately I can't see any evidence for any of these.

The following patch should give us some chance of detecting cases 1 and
3.  The instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official> explain how to rebuild an official kernel package with a patch applied.

Ben.

--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -657,6 +657,7 @@
 
 	/* block any new IO from starting */
 	conf->barrier++;
+	WARN_ON(conf->barrier != 1);
 
 	/* No wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
@@ -671,6 +672,7 @@
 {
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
+	WARN_ON(conf->barrier == 0);
 	conf->barrier--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
@@ -694,6 +696,7 @@
 {
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
+	WARN_ON(conf->nr_pending == 0);
 	conf->nr_pending--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
--- END ----

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: