[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#295657: Interesting article



On Tuesday 13 September 2005 21:15, Erik Forsberg wrote:
> Well, from a quick read it looks like they have problems when
> CONFIG_4KSTACKS are set. As far as I can see, and I may be wrong,
> CONFIG_4KSTACKS is not set in the Debian kernel packages we're having
> problems with.

OK, I am now quite convinced I can point to the layer that causes the
trouble. It's the device mapper that crypts all the info. Remember my
layout of crypted ext3 on raid, the layers data travels through are:

vfs -> ext3 -> device mapper (crypto) -> raid (md) -> physical disks

Last week I reformatted the filesystems to reiserfs (v3.6) and got very
similar errors to ext3:

Oct  2 15:08:05 einstein kernel: ReiserFS: warning: is_tree_node: node level 61124 does not match to the expected one 1
Oct  2 15:08:05 einstein kernel: ReiserFS: dm-1: warning: vs-5150: search_by_key: invalid format found in block 4556481. Fsck?
Oct  2 15:08:05 einstein kernel: ReiserFS: warning: is_tree_node: node level 44943 does not match to the expected one 1
Oct  2 15:08:05 einstein kernel: ReiserFS: dm-1: warning: vs-5150: search_by_key: invalid format found in block 4568631. Fsck?

When I redid my copying, the same errors were triggered:

Oct  2 15:25:36 einstein kernel: ReiserFS: warning: is_tree_node: node level 61124 does not match to the expected one 1
Oct  2 15:25:36 einstein kernel: ReiserFS: dm-1: warning: vs-5150: search_by_key: invalid format found in block 4556481. Fsck?
Oct  2 15:25:57 einstein kernel: ReiserFS: warning: is_tree_node: node level 44943 does not match to the expected one 1
Oct  2 15:25:57 einstein kernel: ReiserFS: dm-1: warning: vs-5150: search_by_key: invalid format found in block 4568631. Fsck?

But... after I did some other things and retried again, there were
no troubles whatsoever. No fsck needed, the disk was read and files
were copied just fine. To me it seemed as if the disk pages were
kicked out of some internal cache, the system had to reread the disk
and the crypto handled the situation correctly that time.

Another things that makes me suspect the crypto-layer is the fact that
sometimes the system spontaneously panics. (I've logged the details of
those panics if anybody is interested.) I suspect this is triggered by
my having a crypted swap partition. If I am correct then a page fault
would read in a page from swap, the crypto-layer would mess up and
return a page with garbage.

Could this be a race in the "read" operation of dm?

For now I've completely removed the crypto from the disks and will run
plain reiser on raid. No problems so far and a stable system 
seems, even during heavy disk-io.

This closes the bug for me (for now), but all in all I feel strongly that
this is a bug that needs further exploring. Maybe it's an easy to find
issue, but maybe it's a specific race condition triggered by my dual P2
setup. Whatever it is,  it severely bit me around three times a week!

If there's anybody looking into this, let me know how I can help.

Cheers, Jeroen



Reply to: