[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: is this hard disk failure?



On Tue, 07 Jun 2011, Miles Fidelman wrote:
> b. you're running RAID - instead of the drive dropping out of the
> array, the entire array slows down as it waits for the failing drive
> to (eventually) respond

Eh, it is worse.

A failing drive _will_ drop out of the array sooner or later, and it can
be very bad if it is does so 'sooner' for any other reason than an
imminent unit failure:  there is a high probability of other device(s)
deciding to also time out while the array is degraded or rebuilding, and
it results in service downtime (and usually data loss).

You never want discs dropping off the array due to
non-immediate-failure-related performance problems, the chance of
multiple drops causing an array failure is too high.  You want to know
the disk is slow, and to replace it in controlled conditions.

This problem is *common*.  Don't do hardware RAID on regular consumer
crap without SCT ERC support (aka TLER/CCTL/ERC), and don't buy
expensive crap with buggy firmware that the vendor refuses to issue a
public fix for to save face (but which you can get from your RAID card
vendor if you are very lucky).  Linux smartctl gives you access to the
drive's SCT ERC page if it is supported.

Also, any device model (not a SPECIFIC device) for which firmware
updates are available that reduce the effective throughput should be
avoided like the plague, as that indicates they have shipped models with
manufacturing or component issues, and you can never be sure of what
you'll get when you buy a new one.

If you already have bought such a device with known high design or
manufacturing defects/weakness ratio, it depends on your luck whether
you got something good or a lemon.  If SMART finds *NO* issues (no
increasing high fly writes, no reallocated sectors grow), and throughput
tests show the expected response, you have a good one: be happy.

If either test shows any such issues, remove it from production.
Secure-erase it, apply any firmware updates if you want to use it as
throw-away backup media (make sure the data is encrypted), or send it
for recycling.

Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens.  Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


Reply to: