[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#638231: linux-image-2.6.32-5-686-bigmem: instability when using ext4



On Wed, 2011-08-17 at 17:38 -0400, Micah Anderson wrote:
> Package: linux-image-2.6.32-5-686-bigmem
> Version: 2.6.32-35
> Severity: important
> Tags: squeeze
> 
> I have two machines that I upgraded to squeeze and migrated their ext3
> filesystems to ext4 due to very high i/o and deep directory hierarchy. These two
> machines have been crashing regularly since the ext4 upgrade. The other machines
> that I have that are running the squeeze kernel and ext3 are not crashing at
> all. 
> 
> When I upgraded the two crashy machines to the backports kernel, the crashes
> stopped. The crashes were happening at least 2x a week, sometimes much more
> frequently. Since the upgrade to the BPO kernel, the machines haven't crashed
> once in two months.
> 
> Both machines were showing console logs when they crashed that were similar:
> either they had nothing on them at all, or they had the following (in some cases
> magic-sysrq worked, sometimes it didn't). 
> 
> It seems pretty clear to me that there are some instability issues with ext4
> in the squeeze kernel. After discussion with Ted Tso on the subject, he indicated
> that there were a number of ext4 fixes that have been done that have not been
> backported to the squeeze kernel.
> 
> What follows are a few of the different things we saw on the console when the
> machine hung:

None of these logs show crashes.

> 1. 
> hoopoe login: [51589.926858] Uniform Multi-Platform E-IDE driver
> [51589.943819] ide-cd driver 5.00
> [51589.982978] ide-gd driver 1.18

I hope you're not actually using the ide-cd driver.

> [51590.039277] st: Version 20081215, fixed bufsize 32768, s/g segs 256
> [51590.262980] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
> [51590.269224] EDD information not available.
> [137993.853645] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
> [137993.860140] EDD information not available.
> [138361.949699] INFO: task rdiff-backup:28337 blocked for more than 120 seconds.
> [138361.957345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [138361.965791] rdiff-backup  D f6967700     0 28337  28335 0x00000000
> [138361.972772]  e8df2640 00200086 c5808e20 f6967700 f696772c c143de20 c143de20 c1439354
> [138361.999573]  e8df27fc c5808e20 00000000 c143de20 ea15f800 c5808e20 ea15f800 c127eb36
> [138362.008374]  c5804354 e8df27fc 020e12de c143de20 c143de20 00000000 00000000 00000000
> [138362.034480] Call Trace:
> [138362.037276]  [<c127eb36>] ? schedule+0x78f/0x7dc
> [138362.042407]  [<c127f28f>] ? __mutex_lock_common+0xe8/0x13b
> [138362.048233]  [<c127f2f1>] ? __mutex_lock_slowpath+0xf/0x11
> [138362.054239]  [<c127f382>] ? mutex_lock+0x17/0x24
> [138362.059403]  [<c127f382>] ? mutex_lock+0x17/0x24
> [138362.081061]  [<c10d40c5>] ? sync_filesystems+0xf/0xbb
> [138362.104668]  [<c10d41a3>] ? sys_sync+0xe/0x29
> [138362.109660]  [<c100813b>] ? sysenter_do_call+0x12/0x28

Probably the disk is being thrashed so sync takes a very long time.  As
I understand it, Linux 2.6.32 had some major changes to writeback
(delayed writes to disk) which made improvements to behaviour in some
situations but had regressions in others.  Unfortunately there isn't a
simple fix that can be cherry-picked.

It could also be a locking bug but I kind of doubt it.  If you're able
to see whether there is ongoing disk I/O then that could confirm which
is the case.

> 2. 
> hoopoe login: [17163.173748] Uniform Multi-Platform E-IDE driver
> [17163.187023] ide-cd driver 5.00
> [17163.216390] ide-gd driver 1.18
> [17163.269327] st: Version 20081215, fixed bufsize 32768, s/g segs 256
> [17163.425212] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
> [17163.431412] EDD information not available.
> [32426.998664] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> [32427.005956] ata4.00: failed command: WRITE FPDMA QUEUED
> [32427.011225] ata4.00: cmd 61/08:00:57:a2:33/00:00:3a:00:00/40 tag 0 ncq 4096 out
> [32427.011227]          res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> [32427.026096] ata4.00: status: { DRDY }
> [32427.029818] ata4: hard resetting link
> [32427.509533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [32427.553582] ata4.00: configured for UDMA/133
> [32427.557906] ata4: EH complete

Likely hardware (or drive firmware) problem, though it could possibly be
a bug in libata.

> 3.
[...]

Same.

> 4.
[...]

Probably disk thrashing again.

If the problem in cases 1 and 4 really is disk thrashing, it may be
worth trying to tune writeback via sysctl vm.dirty_ratio, as explained
in https://lwn.net/Articles/399148/

Cases 2 and 3 are clearly different; you should open a separate bug
report if you think they are not hardware/firmware issues.

Ben.

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: