Re: Why didn't software RAID detect a faulty drive?
Miles Fidelman wrote:
> Seth Mattinen wrote:
>> The system was horribly unresponsive; I never did try adding the drive
>> back in because it was a live server and I didn't want to risk it. I
>> would have expected any RAID to fault an unresponsive drive even if it
>> was a quirk. I just replaced it.
> Two things I learned recently, the hard way, when I had a RAID drive fail:
> 1. Drives can fail in ways that can get masked for a long time, in
> particular - increasing numbers of disk reads or writes that eventually
> succeed - after lots of retries. The symptom is that things slow down
> to a crawl. Not sure why the md software doesn't simply fail drives
> that exhibit long delays, but it doesn't seem to (ideas anyone?).
My guess is that the kernel was masking it. A hardware array controller
will see it directly since it's not relying on intermediate layers and
kick it out of the array. There's absolutely no reason to keep a slow to
respond drive in an array even if it's not throwing errors. This is one
situation where a hardware array has a distinct advantage.
> 2. If all of your drives are the same age - it would be a very good idea
> to replace the OTHER drives in your RAID array before they start
> failing. In my case, I had a server with four drives (2 RAID1 sets).
> As I was recovering from one drive failure, two of the others failed in
> rapid succession. Not very pretty at all.
I had two different brands in the array. ;) One Seagate, one Western
Digital. The WD (recertified, bleh) was the culprit.