On Mon, 2010-06-21 at 11:11 +0100, Ian Jackson wrote: > Ben Hutchings writes ("Re: Bug#584881: Lockups under heavy disk IO; md (RAID) resync/check implicated"): > > We really need to see the kernel messages reporting soft-lockup. > > There aren't any. Or, if there are, it isn't printing them to the > serial console. Perhaps it is trying to send them only to syslogd > which obviously can't write to the disk either. > > I know that the serial console is working after the crash as it echoes > my first carriage return (although then login or the shell wedges > before being able to print a prompt). I could have asked for a magic > sysrq process dump but given that the process table is probably full I > expect this would take many hours at 9600bps. Even if you can't get a process dump, you can get some useful information with: 'd' - show locks held 'l' - show backtrace for active CPUs 'w' - show uninterruptible tasks > > Note that your disk configuration is unusual and probably not > > well-tested by others. It is unlikely that anyone in the kernel team > > will be able to debug this as I don't think any of us have particular > > expertise in this area. > > Search the web suggests that symptoms very similar to mine are not > uncommon, including instances without soft lockup messages, and none > of the other users seem to have a similar disk layout. > > I can't easily test this theory but I think the unusual disk layout is > probably simply making a race easier to trigger. Thinking of some kind of lock-dependency bug? Blocking on a mutex for a long period should still trigger a soft-lockup message. Since there are no messages from the kernel it's something of a mystery what's going on. > > If you have the opportunity, it would be > > helpful if you could test a newer kernel version. This would give us a > > clue as to whether the bug has subsequently been fixed upstream, and if > > not then it would be the basis for an upstream bug report. > > Unfortunately testing this bug on the live system involves (a) risking > a crash and then if the test fails (b) an extremely vulnerable system > without backups for the following four days. That's what I suspected. > I'll see if I can borrow a spare R210 from Jump, in which case I may > be able to reproduce the problem in controlled conditions on my coffee > table at home (and with access to the VGA console). Which kernel > should I test in that case ? Please try 2.6.34 from experimental. Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse.
Description: This is a digitally signed message part