[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

crash and data loss on supermicro xdwn7+/xeon5420/adaptec 52445/lvm/ext4



I reported this as a bug to bugzilla.kernel.org as #16081, but since
we're running Debian I thought asking around for help and discussion
would be advisable.

The current situation is: 
We have a Supermicro XDWN7+ board (Intel 5400, Xeon 5420 CPU, 8GB Ram)
with 24 1TB SATA disks attached to an Adaptec 52445 controller and a
Tandberg Tape-library attached to LSI SAS1068E SAS controller. The
system runs on Lenny with a recompiled (no options changed) Linux
2.6.32-12.

We use bacula 5.0.2 as our backup software (backported to lenny) and so
far it works quite well. The only problem is: After writing around 10TiB
of data to the disks, the machine crashes. This happened two times, and
after the second time both filesystems containing the backup-diskpool
(9TiB LVM-Volumes with ext4 filesystems) were completely garbled. One fs
now looks like this:

shepherd:~# ls -la /mnt/lost+found/  | head -n 20
total 216936
drwx------ 250 root       root          69632 2010-05-31 13:10 .
drwxr-xr-x   3 root       root           4096 2010-05-31 13:10 ..
c----wxr--   1  774037444  162299347 237, 210 1957-02-23 13:50 #1000
brwx-----T   1 1954511736 3121970260 249, 121 1922-08-12 15:08 #10021
b-w---xrwt   1  543753214 3130053982 234, 213 2012-06-01 07:58 #10027
c--S--sr-T   1 3871079531 3443641576   2, 232 2036-01-31 13:12 #10036
-r-S-w-r-T   1 2298731406  344458386    32768 2035-05-22 08:46 #10046
brw---Srw-   1 2052225653 4012639896 218, 196 1912-06-23 18:14 #10067
prwS-wSr-x   1 2235883341 1302567651        0 1927-10-10 00:51 #10086
s-wS--x-wt   1 2286828425 2999490124        0 1949-08-22 22:50 #10109
crw--wSrwt   1 3083778288 3882824206 148, 212 2003-07-28 08:32 #10126
s-wS--sr-x   1  874900871   80451928        0 1977-11-28 01:52 #10130
s--sr-x---   1 1903432768    1059722        0 2013-07-05 00:55 #10131
c-w-r-Sr-T   1 3259732952 2590389953   9,  22 2012-06-19 14:56 #10147
pr-x-w--wt   1 1627318825 1016384218        0 1956-12-27 06:01 #10160
srw-r-SrwT   1 2603486838 3240878817        0 1954-11-16 08:43 #10177
srw---srwt   1  458009213  951782573        0 2023-12-03 18:43 #10184
brwxr--rwx   1 2423698452 2252742920  44, 231 1956-07-25 07:28 #10197
brwS-wS-w-   1 3480615060 1244965598  44, 189 2006-10-21 17:03 #1020

The other one is not mountable anymore:
[88397.252831] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for
group 1 failed (49189!=48621)
[88397.252856] EXT4-fs (dm-1): group descriptors corrupted!

One thing to note is that using Supermicros current BIOS 1.2b for this
board, the machine crashes after a fair amount of network and disk-io
(around 2-5TiB I believe) with an MCE. This does not happen with their
BIOS version 1.1b which is installed at the moment.

I'm at a loss here, as I really don't know what's causing these crashes
and also don't really know how I can debug this any further. Does
anybody have any hints for me?

memtest86 runs fine for hours, by the way, and the machine doesn't have
heat problems (at least the IPMI-console doesn't say so, and the fans
are all fine).

More info on the system can be found at
https://bugzilla.kernel.org/show_bug.cgi?id=16081 (lspci/lsscsi)

Thanks,
Lukas




Reply to: