Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated

To: Ian Jackson <ijackson@chiark.greenend.org.uk>, 584881@bugs.debian.org
Subject: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
From: Ben Hutchings <ben@decadent.org.uk>
Date: Thu, 24 Jun 2010 03:15:35 +0100
Message-id: <1277345735.26161.142.camel@localhost>
Reply-to: Ben Hutchings <ben@decadent.org.uk>, 584881@bugs.debian.org
In-reply-to: <19487.15030.702626.287407@chiark.greenend.org.uk>
References: <19468.49549.475813.179092@chiark.greenend.org.uk> <1277075288.14011.1019.camel@localhost> <19487.15030.702626.287407@chiark.greenend.org.uk>

On Mon, 2010-06-21 at 11:11 +0100, Ian Jackson wrote:
> Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> > We really need to see the kernel messages reporting soft-lockup.
> 
> There aren't any.  Or, if there are, it isn't printing them to the
> serial console.  Perhaps it is trying to send them only to syslogd
> which obviously can't write to the disk either.
> 
> I know that the serial console is working after the crash as it echoes
> my first carriage return (although then login or the shell wedges
> before being able to print a prompt).  I could have asked for a magic
> sysrq process dump but given that the process table is probably full I
> expect this would take many hours at 9600bps.

Even if you can't get a process dump, you can get some useful
information with:
'd' - show locks held
'l' - show backtrace for active CPUs
'w' - show uninterruptible tasks

> > Note that your disk configuration is unusual and probably not
> > well-tested by others.  It is unlikely that anyone in the kernel team
> > will be able to debug this as I don't think any of us have particular
> > expertise in this area.
> 
> Search the web suggests that symptoms very similar to mine are not
> uncommon, including instances without soft lockup messages, and none
> of the other users seem to have a similar disk layout.
> 
> I can't easily test this theory but I think the unusual disk layout is
> probably simply making a race easier to trigger.

Thinking of some kind of lock-dependency bug?  Blocking on a mutex for a
long period should still trigger a soft-lockup message.  Since there are
no messages from the kernel it's something of a mystery what's going on.

> >  If you have the opportunity, it would be
> > helpful if you could test a newer kernel version.  This would give us a
> > clue as to whether the bug has subsequently been fixed upstream, and if
> > not then it would be the basis for an upstream bug report.
> 
> Unfortunately testing this bug on the live system involves (a) risking
> a crash and then if the test fails (b) an extremely vulnerable system
> without backups for the following four days.

That's what I suspected.

> I'll see if I can borrow a spare R210 from Jump, in which case I may
> be able to reproduce the problem in controlled conditions on my coffee
> table at home (and with access to the VGA console).  Which kernel
> should I test in that case ?

Please try 2.6.34 from experimental.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

Attachment: signature.asc
Description: This is a digitally signed message part

Reply to:

Follow-Ups:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

References:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

Prev by Date: Processed: tagging 584881
Next by Date: Bug#586494: linux-image-2.6.26-2-686: Dell PowerEdge 4200's don't work with 2.6.26
Previous by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Next by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Index(es):
- Date
- Thread