[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Software RAID 5 SATA array crashed



Back in October I setup a software RAID 5 array using MD. I used 5x300 gig SATA-II drives, running on two Promise TX4 SATAII controllers (the new ones with NCQ). One controller connected to two drives, and the other to three.

A few days ago, after moving to a new house, I set up the server containing the array and tried to connect to it. I couldn't reach the server through the intranet, so I hooked up a keyboard and monitor to see what was up. When I peered in, I saw that the kernel hadn't even finished its boot procedure. Right as md was loaded by the kernel (it is built in, not a module), there was a call stack and a kernel error "IRQ 193: nobody cared!" or something similar. Following that were repeating messages about SCSI commands failing on three of my drives I believe.

Rebooting the machine didn't make the behavior go away. I powered it off and reseated all of the SATA connectors. This time, when booting up, I made progress. Here is what syslog said upon autodetecting the MD array:

Sep 11 23:46:57 localhost kernel: md: Autodetecting RAID arrays.
Sep 11 23:46:57 localhost kernel: md: autorun ...
Sep 11 23:46:57 localhost kernel: md: considering sdf1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdf1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sde1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdd1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdc1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdb1 ...
Sep 11 23:46:57 localhost kernel: md: created md0
Sep 11 23:46:57 localhost kernel: md: bind<sdb1>
Sep 11 23:46:57 localhost kernel: md: bind<sdc1>
Sep 11 23:46:57 localhost kernel: md: bind<sdd1>
Sep 11 23:46:57 localhost kernel: md: bind<sde1>
Sep 11 23:46:57 localhost kernel: md: bind<sdf1>
Sep 11 23:46:57 localhost kernel: md: running: <sdf1><sde1><sdd1><sdc1><sdb1>
Sep 11 23:46:57 localhost kernel: md: kicking non-fresh sdc1 from array!
Sep 11 23:46:57 localhost kernel: md: unbind<sdc1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdc1)
Sep 11 23:46:57 localhost kernel: md: md0: raid array is not clean -- starting background reconstruction Sep 11 23:46:57 localhost kernel: raid5: device sdf1 operational as raid disk 4 Sep 11 23:46:57 localhost kernel: raid5: device sde1 operational as raid disk 3 Sep 11 23:46:57 localhost kernel: raid5: device sdd1 operational as raid disk 2 Sep 11 23:46:57 localhost kernel: raid5: device sdb1 operational as raid disk 0 Sep 11 23:46:57 localhost kernel: raid5: cannot start dirty degraded array for md0
Sep 11 23:46:57 localhost kernel: RAID5 conf printout:
Sep 11 23:46:57 localhost kernel:  --- rd:5 wd:4 fd:1
Sep 11 23:46:57 localhost kernel:  disk 0, o:1, dev:sdb1
Sep 11 23:46:57 localhost kernel:  disk 2, o:1, dev:sdd1
Sep 11 23:46:57 localhost kernel:  disk 3, o:1, dev:sde1
Sep 11 23:46:57 localhost kernel:  disk 4, o:1, dev:sdf1
Sep 11 23:46:57 localhost kernel: raid5: failed to run raid set md0
Sep 11 23:46:57 localhost kernel: md: pers->run() failed ...
Sep 11 23:46:57 localhost kernel: md: do_md_run() returned -22
Sep 11 23:46:57 localhost kernel: md: md0 stopped.
Sep 11 23:46:57 localhost kernel: md: unbind<sdf1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdf1)
Sep 11 23:46:57 localhost kernel: md: unbind<sde1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sde1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdd1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdd1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdb1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdb1)
Sep 11 23:46:57 localhost kernel: md: ... autorun DONE.

Note the message about sdc being non-fresh. Also note that the array is both DIRTY and DEGRADED. Degraded (I'm guessing) because sdc is detected as failed, and dirty because the machine was powered off when it was erroring and the array wasn't able to flush properly.

I played around with mdadm but I could never get the array to start. All of the superblocks were intact, including sdc. Finally, I ran "mdrun", which managed to start the array. Here is the logging associated with this command:

Sep 12 01:04:37 localhost kernel: md: md0 stopped.
Sep 12 01:04:37 localhost kernel: md: bind<sdc>
Sep 12 01:04:37 localhost kernel: md: bind<sdd>
Sep 12 01:04:37 localhost kernel: md: bind<sdf>
Sep 12 01:04:37 localhost kernel: md: bind<sde>
Sep 12 01:04:37 localhost kernel: md: bind<sdb>
Sep 12 01:04:37 localhost kernel: md: md0: raid array is not clean -- starting background reconstruction Sep 12 01:04:37 localhost kernel: raid5: device sdb operational as raid disk 0 Sep 12 01:04:37 localhost kernel: raid5: device sde operational as raid disk 4 Sep 12 01:04:37 localhost kernel: raid5: device sdf operational as raid disk 3 Sep 12 01:04:37 localhost kernel: raid5: device sdd operational as raid disk 2 Sep 12 01:04:37 localhost kernel: raid5: device sdc operational as raid disk 1
Sep 12 01:04:37 localhost kernel: raid5: allocated 5248kB for md0
Sep 12 01:04:37 localhost kernel: raid5: raid level 5 set md0 active with 5 out of 5 devices, algorithm 2
Sep 12 01:04:37 localhost kernel: RAID5 conf printout:
Sep 12 01:04:37 localhost kernel:  --- rd:5 wd:5 fd:0
Sep 12 01:04:37 localhost kernel:  disk 0, o:1, dev:sdb
Sep 12 01:04:37 localhost kernel:  disk 1, o:1, dev:sdc
Sep 12 01:04:37 localhost kernel:  disk 2, o:1, dev:sdd
Sep 12 01:04:37 localhost kernel:  disk 3, o:1, dev:sdf
Sep 12 01:04:37 localhost kernel:  disk 4, o:1, dev:sde
Sep 12 01:04:37 localhost kernel: .<6>md: syncing RAID array md0
Sep 12 01:04:37 localhost kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. Sep 12 01:04:37 localhost kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction. Sep 12 01:04:37 localhost kernel: md: using 128k window, over a total of 293057280 blocks.
Sep 12 01:04:37 localhost kernel: md: md1 stopped.
Sep 12 01:04:37 localhost last message repeated 4 times

So it seems to me that mdrun forced the array to start, and since it began "syncing", it assumed sdc was not failed, and used all the drives to reconstruct the parity information (sync: rebuild parity, reconstruct: rebuild drive).

During the resync, and even when it was done, I could not access the XFS filesystem. Both xfs_repair and xfs_check could not find a valid XFS superblock. I let xfs_repair check the entire device and it could not find a single XFS superblock. However, piping /dev/md0 into strings does yield some filenames that I recognize from the device.

So now I've got this array, and I still don't know what malfunctioned. In addition, I have a bad filesystem which I don't want to give up on, because I'd be losing a ton of data. Anyone have any suggestions?

-Adar

PS: I'm not subscribed to debian-user, so please include me in the replies.



Reply to: