Bug#584881: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel

To: 584881@bugs.debian.org
Subject: Bug#584881: Deadlock in md barrier code? / RAID1 / LVM CoW snapshot + ext3 / Debian 5.0 - lenny 2.6.26 kernel
From: Tim Small <tim@seoss.co.uk>
Date: Fri, 17 Sep 2010 17:18:37 +0100
Message-id: <[🔎] 4C9394DD.80407@seoss.co.uk>
Reply-to: Tim Small <tim@seoss.co.uk>, 584881@bugs.debian.org

Hi,

I recently posted the below message to linux-raid, but perhaps it shouldhave gone here first... Perhaps Neil Brown will have some bright ideas.

Common factors on all three pieces of hardware seeing the problem seemto have been:


Lenny 2.6.26 kernel
Serial console
md with lvm snapshots
Dell hardware
4 or more cores


HTH,

Tim.




Hi,

I have a box with a relatively simple setup:

sda + sdb are 1TB SATA drives attached to an Intel ICH10.
Three partitions on each drive, three md raid1s built on top of these:

md0 /
md1 swap
md2 LVM PV

During resync about a week ago, processes seemed to deadlock on I/O, themachine was still alive but with a load of 100+. A USB drive happenedto be mounted, so I managed to save /var/log/kern.log At the time ofthe problem, the monthly RAID check was in progress. On reboot, arebuild commenced, and the same deadlock seemed to occur between roughly2 minutes and 15 minutes after boot.

At this point, the server was running on a Dell PE R300 (12G RAM,quad-core), with an LSI SAS controller and 2x 500G SATA drives. Ishifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TBdrive, 8G RAM, quad-core+HT), with only a single drive, so I created themd RAID1s with just a single drive in each. The original box was putoffline with the idea of me debugging it "soon".

This morning, I added in a second 1TB drive, and during the resync(approx 1 hour in), the deadlock up occurred again. The resync hadstopped, and any attempt to write to md2 would deadlock the process inquestion. I think it was doing an rsnaphot backup to a USB drive at thetime the initial problem occurred - this creates an LVM snapshot deviceon top of md2 for the duration of the backup for each filesystem backedup (there are two at the moment), and I suppose this results in lots ofread-copy-update operations - the mounting of the snapshots shows up inthe logs as the fs-mounts, and subsequent orphan_cleanups. As thesnapshot survives the reboot, I assume this is what triggers thesubsequent lockup after the machine has rebooted.

I got a couple of 'echo w > /proc/sysrq-trigger' sets of output thistime... Edited copies of kern.log are attached - looks like it'sbarrier related. I'd guess the combination of the LVM CoW snapshot, andthe RAID resync are tickling this bug.

Any thoughts? Maybe this is related to Debian bug #584881 -http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881


... since the kernel is essentially the same.

I can do some debugging on this out-of-office-hours, or can probablyresurrect the original hardware to debug that too.


Logs are here:

http://buttersideup.com/files/md-raid1-lockup-lvm-snapshot/

I think vger binned the first version of this email (with the logsattached) - so apologies if you've ended up with two copies of this email...


Tim.


--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

Reply to:

Prev by Date: Processing of linux-2.6_2.6.26-25lenny1_amd64.changes
Next by Date: Processed: Re: Bug#597209: base: keyboard doesn't work properly
Previous by thread: Processing of linux-2.6_2.6.26-25lenny1_amd64.changes
Next by thread: Re: Bug#597209: base: keyboard doesn't work properly
Index(es):
- Date
- Thread