Why didn't software RAID detect a faulty drive?
So I normally use hardware array controllers, but I have software RAID1
on one server. Since I know we have some pro software RAID people on
here, hopefully you can help me out. ;)
I got paged that HTTPS was flapping on a server, tried to log in,
horrible responsiveness, etc. I fired up the serial console and saw this
repeating in cycles with minor variations:
[3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[3948800.949314] ata2.00: BMDMA stat 0x4
[3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag 0
dma 8192 out
[3948800.960276] res 51/84:0a:75:02:22/00:00:00:00:00/e0 Emask
0x10 (ATA bus error)
[3948801.007509] ata2.00: status: { DRDY ERR }
[3948801.020017] ata2.00: error: { ICRC ABRT }
[3948801.032537] ata2: soft resetting link
[3948801.212298] ata2.00: configured for UDMA/33
[3948801.225345] ata2: EH complete
It's running software raid, so why is it locking up? I managed to log in
as root and cat /proc/mdstat:
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
78148096 blocks [2/2] [UU]
unused devices: <none>
Huh, I think to myself, stupid thing didn't work. So I try to manually
fault it because it didn't figure it out on its own:
# mdadm /dev/md0 --fail /dev/sdb1
But that didn't help either:
[3948838.514699] raid1: Disk failure on sdb1, disabling device.
[3948838.514702] raid1: Operation continuing on 1 devices.
[3948846.397781] RAID1 conf printout:
[3948846.409726] --- wd:1 rd:2
[3948846.418353] disk 0, wo:0, o:1, dev:sda1
[3948846.430623] disk 1, wo:1, o:0, dev:sdb1
[3948846.452002] md: recovery of RAID array md0
[3948846.464006] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[3948846.482338] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for recovery.
[3948846.511468] md: using 128k window, over a total of 78148096 blocks.
[3948846.530732] md: resuming recovery of md0 from checkpoint.
[3948846.547400] md: md0: recovery done.
[3948846.568079] RAID1 conf printout:
[3948846.576383] --- wd:1 rd:2
[3948846.585012] disk 0, wo:0, o:1, dev:sda1
[3948846.597284] disk 1, wo:1, o:0, dev:sdb1
[3948846.617747] md: recovery of RAID array md0
[3948846.630524] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[3948846.648489] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for recovery.
[3948846.678992] md: using 128k window, over a total of 78148096 blocks.
[3948846.696152] md: resuming recovery of md0 from checkpoint.
[3948846.715443] md: md0: recovery done.
This kept repeating until I pulled the plug. Luckily it remembered it
should stay faulted when it came back up.
It's running stable/lenny. What happened here?
~Seth
Reply to: