[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Why didn't software RAID detect a faulty drive?



So I normally use hardware array controllers, but I have software RAID1 on one server. Since I know we have some pro software RAID people on here, hopefully you can help me out. ;)

I got paged that HTTPS was flapping on a server, tried to log in, horrible responsiveness, etc. I fired up the serial console and saw this repeating in cycles with minor variations:

[3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[3948800.949314] ata2.00: BMDMA stat 0x4
[3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag 0 dma 8192 out [3948800.960276] res 51/84:0a:75:02:22/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
[3948801.007509] ata2.00: status: { DRDY ERR }
[3948801.020017] ata2.00: error: { ICRC ABRT }
[3948801.032537] ata2: soft resetting link
[3948801.212298] ata2.00: configured for UDMA/33
[3948801.225345] ata2: EH complete

It's running software raid, so why is it locking up? I managed to log in as root and cat /proc/mdstat:

Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      78148096 blocks [2/2] [UU]
unused devices: <none>

Huh, I think to myself, stupid thing didn't work. So I try to manually fault it because it didn't figure it out on its own:

# mdadm /dev/md0 --fail /dev/sdb1

But that didn't help either:

[3948838.514699] raid1: Disk failure on sdb1, disabling device.
[3948838.514702] raid1: Operation continuing on 1 devices.
[3948846.397781] RAID1 conf printout:
[3948846.409726]  --- wd:1 rd:2
[3948846.418353]  disk 0, wo:0, o:1, dev:sda1
[3948846.430623]  disk 1, wo:1, o:0, dev:sdb1
[3948846.452002] md: recovery of RAID array md0
[3948846.464006] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[3948846.482338] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[3948846.511468] md: using 128k window, over a total of 78148096 blocks.
[3948846.530732] md: resuming recovery of md0 from checkpoint.
[3948846.547400] md: md0: recovery done.
[3948846.568079] RAID1 conf printout:
[3948846.576383]  --- wd:1 rd:2
[3948846.585012]  disk 0, wo:0, o:1, dev:sda1
[3948846.597284]  disk 1, wo:1, o:0, dev:sdb1
[3948846.617747] md: recovery of RAID array md0
[3948846.630524] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[3948846.648489] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[3948846.678992] md: using 128k window, over a total of 78148096 blocks.
[3948846.696152] md: resuming recovery of md0 from checkpoint.
[3948846.715443] md: md0: recovery done.

This kept repeating until I pulled the plug. Luckily it remembered it should stay faulted when it came back up.

It's running stable/lenny. What happened here?

~Seth


Reply to: