Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated

To: Ian Jackson <ijackson@chiark.greenend.org.uk>, 584881@bugs.debian.org
Subject: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
From: Ben Hutchings <ben@decadent.org.uk>
Date: Thu, 22 Jul 2010 13:33:41 +0100
Message-id: <[🔎] 1279802021.4883.356.camel@localhost>
Reply-to: Ben Hutchings <ben@decadent.org.uk>, 584881@bugs.debian.org
In-reply-to: <[🔎] 19526.6725.662000.364909@chiark.greenend.org.uk>
References: <19468.49549.475813.179092@chiark.greenend.org.uk> <1277075288.14011.1019.camel@localhost> <19487.15030.702626.287407@chiark.greenend.org.uk> <1277345735.26161.142.camel@localhost> <[🔎] 19526.6725.662000.364909@chiark.greenend.org.uk>

On Tue, 2010-07-20 at 22:51 +0100, Ian Jackson wrote: 
> Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> > Please try 2.6.34 from experimental.
> 
> I've now replicated the problem on my coffee table with a temporary
> (intermittent) loan of an identical machine from Jump networks.
> I can get 2.6.26-21lenny4 and 2.6.26-24 to crash on demand very
> easily.  In none of the crashes do I get any kind of kernel log
> messages (eg, soft lockup warnings).
[...]

But I do see a deadlock on the barrier between resync and normal I/O, as
you suspected.  Every process is blocked in wait_barrier(),
raise_barrier() or a filesystem lock.

The only explanations I can think of are:
1. The regular I/O requests were miscounted so that the resync process
will wait forever for them to complete in raise_barrier(), while the
other processes wait forever for the barrier to be lowered.  (It seems
like the directions of this barrier are inverted.  Maybe it is really a
bollard.)
2. The wait condition of one of the waiters was met, but it wasn't
woken.
3. A thread which raised the barrier failed to lower the barrier.
4. A thread which raised the barrier caused paging.

Unfortunately I can't see any evidence for any of these.

The following patch should give us some chance of detecting cases 1 and
3.  The instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official> explain how to rebuild an official kernel package with a patch applied.

Ben.

--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -657,6 +657,7 @@

 	/* block any new IO from starting */
 	conf->barrier++;
+	WARN_ON(conf->barrier != 1);

 	/* No wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
@@ -671,6 +672,7 @@
 {
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
+	WARN_ON(conf->barrier == 0);
 	conf->barrier--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
@@ -694,6 +696,7 @@
 {
 	unsigned long flags;
 	spin_lock_irqsave(&conf->resync_lock, flags);
+	WARN_ON(conf->nr_pending == 0);
 	conf->nr_pending--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
--- END ----

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part

Reply to:

Follow-Ups:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

References:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

Prev by Date: Bug#589945: linux-image-2.6.32-5-amd64: NFS client hangs
Next by Date: Processed: reassign 589832 to linux-2.6
Previous by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Next by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Index(es):
- Date
- Thread