Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated

To: Ben Hutchings <ben@decadent.org.uk>
Cc: 584881@bugs.debian.org
Subject: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
From: Ian Jackson <ijackson@chiark.greenend.org.uk>
Date: Mon, 21 Jun 2010 11:11:02 +0100
Message-id: <19487.15030.702626.287407@chiark.greenend.org.uk>
Reply-to: Ian Jackson <ijackson@chiark.greenend.org.uk>, 584881@bugs.debian.org
In-reply-to: <1277075288.14011.1019.camel@localhost>
References: <19468.49549.475813.179092@chiark.greenend.org.uk> <1277075288.14011.1019.camel@localhost>

Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"):
> We really need to see the kernel messages reporting soft-lockup.

There aren't any.  Or, if there are, it isn't printing them to the
serial console.  Perhaps it is trying to send them only to syslogd
which obviously can't write to the disk either.

I know that the serial console is working after the crash as it echoes
my first carriage return (although then login or the shell wedges
before being able to print a prompt).  I could have asked for a magic
sysrq process dump but given that the process table is probably full I
expect this would take many hours at 9600bps.

> Note that your disk configuration is unusual and probably not
> well-tested by others.  It is unlikely that anyone in the kernel team
> will be able to debug this as I don't think any of us have particular
> expertise in this area.

Search the web suggests that symptoms very similar to mine are not
uncommon, including instances without soft lockup messages, and none
of the other users seem to have a similar disk layout.

I can't easily test this theory but I think the unusual disk layout is
probably simply making a race easier to trigger.

>  If you have the opportunity, it would be
> helpful if you could test a newer kernel version.  This would give us a
> clue as to whether the bug has subsequently been fixed upstream, and if
> not then it would be the basis for an upstream bug report.

Unfortunately testing this bug on the live system involves (a) risking
a crash and then if the test fails (b) an extremely vulnerable system
without backups for the following four days.

I'll see if I can borrow a spare R210 from Jump, in which case I may
be able to reproduce the problem in controlled conditions on my coffee
table at home (and with access to the VGA console).  Which kernel
should I test in that case ?

Ian.

Reply to:

Follow-Ups:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>

References:
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>
- Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Bug#586554: initramfs-tools fails to upgrade from 0.96.1 to 0.97
Next by Date: Bug#551101: still exists
Previous by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Next by thread: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated
Index(es):
- Date
- Thread