[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Severe EXT3 bug(s) in vanilla kernel 2.6.18?



Hi all,

After browsing through the debian-kernel mailinglist archive, I found
out that there's no one reporting the latest EXT3 problems in the
vanilla kernel. The last report of EXT3-problems on the debian-kernel
list had to do with JBD, the current problems (as posted on the Linux
Kernel mailinglist) are much worse, I think.
You might want to check those URLS/subjects of discussion on LKML:

  "2.6.18-mm2: ext3 BUG?"
  http://lkml.org/lkml/2006/10/5/353
Seems unresolved


  "2.6.19 file content corruption on ext3"
  http://lkml.org/lkml/2006/12/7/163
Has to do with 2.6.19, but might have it's roots in 2.6.18


  "Debugging I/O errors?"
  http://lkml.org/lkml/2006/10/20/93
Source unknown, but more people seem to have the same problem.


These issues got my attention, because I'm having those (or similar)
problems myself, on two different machines (clusters, actually) with
completely different hardware and disks. I'll explain.

I'm maintaining two clusters, with machines running a mix between Debian
Stable with Etch-kernels to have AoE (ATA over Ethernet support).
Machines in these clusters "export" their harddisks using AoE (check out
the "vblade" package), and one machine imports those using the kernel
"aoe"-module. On top of those imported devices, multiple RAID5-arrays
are created, and LVM is running on top of RAID, ext3 on the LVM LV.

After a few days, I get EXT3-errors. like this:
> EXT3-fs: mounted filesystem with ordered data mode.
> EXT3-fs error (device loop0): ext3_free_blocks_sb: bit already cleared for block 412186
> Aborting journal on device loop0.
> EXT3-fs error (device loop0) in ext3_free_blocks_sb: Journal has aborted
> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device loop0) in ext3_truncate: Journal has aborted
> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device loop0) in ext3_orphan_del: Journal has aborted
> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device loop0) in ext3_delete_inode: Journal has aborted
> __journal_remove_journal_head: freeing b_committed_data
> __journal_remove_journal_head: freeing b_committed_data
(...)
> __journal_remove_journal_head: freeing b_committed_data
> ext3_abort called.
> EXT3-fs error (device loop0): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
> __journal_remove_journal_head: freeing b_committed_data

FSCK'ing the filesystem fixes those errors, but after a few days (or
weeks, depending on the fs load) the corruptions appear again. I might
be worth telling you that there are no other suspicious messages in my logs.

This seems to be related to the problem described here:
  http://myrddin.org/2006/02/14/ext3-nastiness/

and here:
  http://www.debian-administration.org/users/Utumno/weblog/16


I don't know if I need to file a bug on this, for now I just want to
here your thoughts. FYI:

Kernel information for cluster 1:
> root@infinity:~# uname -a
> Linux infinity 2.6.17-2-686 #1 SMP Wed Sep 13 16:34:10 UTC 2006 i686 GNU/Linux

And cluster 2:
> dust:~# uname -a
> Linux dust 2.6.18-3-686 #1 SMP Thu Nov 23 20:49:23 UTC 2006 i686 GNU/Linux

Thanks for your replies!

Best regards,

  -- Bas van Schaik




Reply to: