Why didn't software RAID detect a faulty drive?

To: debian-isp@lists.debian.org
Subject: Why didn't software RAID detect a faulty drive?
From: Seth Mattinen <sethm@rollernet.us>
Date: Sat, 18 Jul 2009 17:06:19 -0700
Message-id: <[🔎] 4A62637B.5020808@rollernet.us>

So I normally use hardware array controllers, but I have software RAID1on one server. Since I know we have some pro software RAID people onhere, hopefully you can help me out. ;)

I got paged that HTTPS was flapping on a server, tried to log in,horrible responsiveness, etc. I fired up the serial console and saw thisrepeating in cycles with minor variations:


[3948800.929508] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[3948800.949314] ata2.00: BMDMA stat 0x4

[3948800.960273] ata2.00: cmd ca/00:10:6f:02:22/00:00:00:00:00/e0 tag 0dma 8192 out[3948800.960276] res 51/84:0a:75:02:22/00:00:00:00:00/e0 Emask0x10 (ATA bus error)

[3948801.007509] ata2.00: status: { DRDY ERR }
[3948801.020017] ata2.00: error: { ICRC ABRT }
[3948801.032537] ata2: soft resetting link
[3948801.212298] ata2.00: configured for UDMA/33
[3948801.225345] ata2: EH complete

It's running software raid, so why is it locking up? I managed to log inas root and cat /proc/mdstat:


Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      78148096 blocks [2/2] [UU]
unused devices: <none>

Huh, I think to myself, stupid thing didn't work. So I try to manuallyfault it because it didn't figure it out on its own:


# mdadm /dev/md0 --fail /dev/sdb1

But that didn't help either:

[3948838.514699] raid1: Disk failure on sdb1, disabling device.
[3948838.514702] raid1: Operation continuing on 1 devices.
[3948846.397781] RAID1 conf printout:
[3948846.409726]  --- wd:1 rd:2
[3948846.418353]  disk 0, wo:0, o:1, dev:sda1
[3948846.430623]  disk 1, wo:1, o:0, dev:sdb1
[3948846.452002] md: recovery of RAID array md0
[3948846.464006] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.

[3948846.482338] md: using maximum available idle IO bandwidth (but notmore than 200000 KB/sec) for recovery.

[3948846.511468] md: using 128k window, over a total of 78148096 blocks.
[3948846.530732] md: resuming recovery of md0 from checkpoint.
[3948846.547400] md: md0: recovery done.
[3948846.568079] RAID1 conf printout:
[3948846.576383]  --- wd:1 rd:2
[3948846.585012]  disk 0, wo:0, o:1, dev:sda1
[3948846.597284]  disk 1, wo:1, o:0, dev:sdb1
[3948846.617747] md: recovery of RAID array md0
[3948846.630524] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.

[3948846.648489] md: using maximum available idle IO bandwidth (but notmore than 200000 KB/sec) for recovery.

[3948846.678992] md: using 128k window, over a total of 78148096 blocks.
[3948846.696152] md: resuming recovery of md0 from checkpoint.
[3948846.715443] md: md0: recovery done.

This kept repeating until I pulled the plug. Luckily it remembered itshould stay faulted when it came back up.


It's running stable/lenny. What happened here?

~Seth

Reply to:

Follow-Ups:
- Re: Why didn't software RAID detect a faulty drive?
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: Why didn't software RAID detect a faulty drive?
  - From: martin f krafft <madduck@debian.org>

Prev by Date: Re: ISPmail Lenny tutorial ready
Next by Date: Re: Why didn't software RAID detect a faulty drive?
Previous by thread: Re: Inquiry on Frame Relay Link remote login
Next by thread: Re: Why didn't software RAID detect a faulty drive?
Index(es):
- Date
- Thread