[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Why didn't software RAID detect a faulty drive?

On Mon, July 20, 2009 8:09 pm, martin f krafft wrote:
> also sprach Seth Mattinen <sethm@rollernet.us> [2009.07.19.2052 +0200]:
>> My guess is that the kernel was masking it. A hardware array
>> controller will see it directly since it's not relying on
>> intermediate layers and kick it out of the array. There's
>> absolutely no reason to keep a slow to respond drive in an array
>> even if it's not throwing errors. This is one situation where
>> a hardware array has a distinct advantage.
> I agree, and I have forwarded your issue to upstream. However, for
> better results, I suggest you take your issue to
> linux-raid@vger.kernel.org (and CC me). Upstream will prefer to
> respond there.

1/ Why didn't md fail the drive.

 'md' doesn't have a sophisticated model of drives.  It just submits
 requests and waits for replies.  The report reports with success or
 There is no concept of how long a request should take, so no way to
 decide a request took "too long".  Such decisions really need to be
 made lower down in the driver for the particular device.

 There is a flag that can be added to a request: FAILFAST.
 If it is set, no error recovery is attempted.  If it is not set
 lots of error recovery is attempted.  Neither of these are ideal.  We
 really want "try some error recovery, but don't try to hard" ... which
 isn't really well defined.  But in any case there is no way to ask
 for something like that so md doesn't.

 Yes, this is a shortcoming.  We really need someone who understands
 the various failure modes of devices (i.e. not me) to design the
 right sort of interface.

2/ Why did md re-add the device.
 This is a mystery....
 When a device fails, the md core code tries to disconnect it from the
 array (which is a different thing from "mdadm --remove").  If there are
 any outstanding IO requests this will fail, to be retried later.
 I suspect that when it tries, there is still some other IO slowly
 retrying so the disconnect doesn't work.  Then somehow the faulty drive
 is thought to be a spare and is rebuilt.
 That shouldn't happen and I haven't convinced my self from looking
 at the code that it can happen, so I could be wrong.  But it's my
 best bet at the moment.... I'll try to look a bit more some time.


Reply to: