Bug#292290: kernel-image-2.6.8-1-k7: XFS filesystem corruption: Input/output error
Hi,
This is a followup for Debian bug <http://bugs.debian.org/292290>.
Joost van Baal <joostvb-debian-bugs-20050126-3@mdcc.cx> - Wed, Jan 26, 2005:
> `./lib/modules/2.6.10-1-k7/kernel/drivers/atm/zatm.ko': Unknown error 990
> I've heard of one other victim of this problem with this kernel.
Wessel Dankers <wsl-debbugs@fruit.eu.org> - Thu, Jan 27, 2005:
> I myself have been a victim of this too, so I thought I'd join in.
Well, me too.
> - the kernel was Debian's 2.6.8;
> - the filesystem in question was XFS;
> - software raid1 (mirroring) was used.
> XFS complained about corrupted in-memory structures in some of the cases.
> However, it is very unlikely that all three machines have bad RAM, and
> memtest86+ reports no problems.
I am also using Debian's kernel-image-2.6.8-2-686 in Version 2.6.8-13.
First of all, I'm using a PIV, so this aint K7 specific. I am NOT
using RAID 1 nor LVM, pure XFS.
This first corruption appeared with my "mail/debian-project/" folder,
precisely on the "tmp/" subdirectory. The second appeared today, on
the ./usr/share/doc/texmf/help/Catalogue/entries/romannum.html:
dpkg: error processing
/var/cache/apt/archives/tetex-doc_2.0.2c-6_all.deb (--unpack):
unable to stat
`./usr/share/doc/texmf/help/Catalogue/entries/romannum.html' (which I
was about to install): Unknown error 990
This is a really serious XFS problem it seems.
Trying to understand the problem suggested I tried stracing:
bee% LC_ALL=C strace -f ls debian-project-fucked/tmp 2>&1
...
rt_sigprocmask(SIG_UNBLOCK, [RTMIN], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) =
0
brk(0) = 0x805b000
brk(0x807c000) = 0x807c000
brk(0) = 0x807c000
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo
...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=24, ws_col=80, ws_xpixel=644,
ws_ypixel=388}) = 0
stat64("debian-project-fucked/tmp", {st_mode=0, st_size=0, ...}) = -990
write(2, "ls: ", 4ls: ) = 4
write(2, "debian-project-fucked/tmp", 25debian-project-fucked/tmp) = 25
write(2, ": Unknown error 990", 19: Unknown error 990) = 19
write(2, "\n", 1
) = 1
The problems seems to occur with the stat64() syscall, but I couldn't
find out what error 990 is supposed to be in the /usr/include headers
so I moved on to the kernel source and looked to the various syscalls
implementations. I also tried understanding what syscalls could
trigger the problem:
I checked with:
bee% LC_ALL=C strace zsh -e -c "cd debian-project-fucked/tmp; ls"
and got the error with a chdir() too, and hence looked at sys_chdir().
Then I checked whether this was directory specific, and tried:
bee% LC_ALL=C strace -f ls -i \
/usr/share/doc/texmf/help/Catalogue/entries/ 2>&1
I got errors on a bunch of files, in the lstat64().
Then I looked upstream, first at bugme.osdl.org, and found:
http://bugme.osdl.org/show_bug.cgi?id=3224 (still open)
Finally, I looked at SGI's bugzilla, and found a first bug bubble:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=197
The problem also seems to appear in a comment of:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=383
197 is really worth reading, and using MD / LVM devices seems to help
trigger the bug.
These are dups of the above:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=204
http://oss.sgi.com/bugzilla/show_bug.cgi?id=207
The final patch attached to the bug report is:
http://oss.sgi.com/bugzilla/attachment.cgi?id=59&action=view
I couldn't find an applied version in the kernel, it looked somehow too
much different but the xfs_finish_reclaim_all() was there...
2.6.8 was released in august 2004, and the patch mentionned dates
january 2003, so I can only think we face a different bug.
Then I went thoroughly through the bugzilla and found another bug which
might be related:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=338 is on a 2.4 kernel
When I found out error 990 means EFSCORRUPTED, I thought I wouldn't be
able to track down the problem any further...
So I'm about to get a fresh xfsprogs or a live CD and xfs_repair my FS
to get a log and send it upstream.
Regards,
--
Loïc Minier <lool@dooz.org>
"Neutral President: I have no strong feelings one way or the other."
Reply to: