Re: Why didn't software RAID detect a faulty drive?
On Mon, July 20, 2009 8:09 pm, martin f krafft wrote:
> also sprach Seth Mattinen <firstname.lastname@example.org> [2009.07.19.2052 +0200]:
>> My guess is that the kernel was masking it. A hardware array
>> controller will see it directly since it's not relying on
>> intermediate layers and kick it out of the array. There's
>> absolutely no reason to keep a slow to respond drive in an array
>> even if it's not throwing errors. This is one situation where
>> a hardware array has a distinct advantage.
> I agree, and I have forwarded your issue to upstream. However, for
> better results, I suggest you take your issue to
> email@example.com (and CC me). Upstream will prefer to
> respond there.
1/ Why didn't md fail the drive.
'md' doesn't have a sophisticated model of drives. It just submits
requests and waits for replies. The report reports with success or
There is no concept of how long a request should take, so no way to
decide a request took "too long". Such decisions really need to be
made lower down in the driver for the particular device.
There is a flag that can be added to a request: FAILFAST.
If it is set, no error recovery is attempted. If it is not set
lots of error recovery is attempted. Neither of these are ideal. We
really want "try some error recovery, but don't try to hard" ... which
isn't really well defined. But in any case there is no way to ask
for something like that so md doesn't.
Yes, this is a shortcoming. We really need someone who understands
the various failure modes of devices (i.e. not me) to design the
right sort of interface.
2/ Why did md re-add the device.
This is a mystery....
When a device fails, the md core code tries to disconnect it from the
array (which is a different thing from "mdadm --remove"). If there are
any outstanding IO requests this will fail, to be retried later.
I suspect that when it tries, there is still some other IO slowly
retrying so the disconnect doesn't work. Then somehow the faulty drive
is thought to be a spare and is rebuilt.
That shouldn't happen and I haven't convinced my self from looking
at the code that it can happen, so I could be wrong. But it's my
best bet at the moment.... I'll try to look a bit more some time.