On Tue, 2010-07-20 at 22:51 +0100, Ian Jackson wrote:
> Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> > Please try 2.6.34 from experimental.
>
> I've now replicated the problem on my coffee table with a temporary
> (intermittent) loan of an identical machine from Jump networks.
> I can get 2.6.26-21lenny4 and 2.6.26-24 to crash on demand very
> easily. In none of the crashes do I get any kind of kernel log
> messages (eg, soft lockup warnings).
[...]
But I do see a deadlock on the barrier between resync and normal I/O, as
you suspected. Every process is blocked in wait_barrier(),
raise_barrier() or a filesystem lock.
The only explanations I can think of are:
1. The regular I/O requests were miscounted so that the resync process
will wait forever for them to complete in raise_barrier(), while the
other processes wait forever for the barrier to be lowered. (It seems
like the directions of this barrier are inverted. Maybe it is really a
bollard.)
2. The wait condition of one of the waiters was met, but it wasn't
woken.
3. A thread which raised the barrier failed to lower the barrier.
4. A thread which raised the barrier caused paging.
Unfortunately I can't see any evidence for any of these.
The following patch should give us some chance of detecting cases 1 and
3. The instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official> explain how to rebuild an official kernel package with a patch applied.
Ben.
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -657,6 +657,7 @@
/* block any new IO from starting */
conf->barrier++;
+ WARN_ON(conf->barrier != 1);
/* No wait for all pending IO to complete */
wait_event_lock_irq(conf->wait_barrier,
@@ -671,6 +672,7 @@
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
+ WARN_ON(conf->barrier == 0);
conf->barrier--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
@@ -694,6 +696,7 @@
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
+ WARN_ON(conf->nr_pending == 0);
conf->nr_pending--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
--- END ----
Ben.
--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
Attachment:
signature.asc
Description: This is a digitally signed message part