Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated

To: Ben Hutchings <ben@decadent.org.uk>
Cc: 584881@bugs.debian.org
Subject: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
From: Ian Jackson <ijackson@chiark.greenend.org.uk>
Date: Thu, 24 Jun 2010 11:17:28 +0100
Message-id: <19491.12472.880707.704477@chiark.greenend.org.uk>
Reply-to: Ian Jackson <ijackson@chiark.greenend.org.uk>, 584881@bugs.debian.org
In-reply-to: <1277345735.26161.142.camel@localhost>
References: <19468.49549.475813.179092@chiark.greenend.org.uk> <1277075288.14011.1019.camel@localhost> <19487.15030.702626.287407@chiark.greenend.org.uk> <1277345735.26161.142.camel@localhost>

Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> Even if you can't get a process dump, you can get some useful
> information with:

Right, thanks.

> 'd' - show locks held
> 'l' - show backtrace for active CPUs
> 'w' - show uninterruptible tasks

I'll try these although I suspect thousands of uninterruptible tasks.

> > Search the web suggests that symptoms very similar to mine are not
> > uncommon, including instances without soft lockup messages, and none
> > of the other users seem to have a similar disk layout.
> > 
> > I can't easily test this theory but I think the unusual disk layout is
> > probably simply making a race easier to trigger.
> 
> Thinking of some kind of lock-dependency bug?  Blocking on a mutex for a
> long period should still trigger a soft-lockup message.  Since there are
> no messages from the kernel it's something of a mystery what's going on.

The RAID system (md driver) has a separate mechanism for blocking
writes, which it calls a "barrier".  I think it is quite capable of
indefinitely blocking all writes to a device without necessarily
triggering the soft lockup detector.

> > I'll see if I can borrow a spare R210 from Jump, in which case I may
> > be able to reproduce the problem in controlled conditions on my coffee
> > table at home (and with access to the VGA console).  Which kernel
> > should I test in that case ?
> 
> Please try 2.6.34 from experimental.

Will do.  I'll get back to you.

Thanks,
Ian.

Reply to:

Follow-Ups:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>

References:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Bug#586494: linux-image-2.6.26-2-686: Dell PowerEdge 4200's don't work with 2.6.26
Next by Date: Bug#587014: [linux-2.6] screen brightness can't be modified on Panasonic S9
Previous by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Next by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Index(es):
- Date
- Thread