[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: is this hard disk failure?



Ralf Mardorf wrote:
For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.


By the time a disk gets to the click-click-click phase, there has been LOTS of warning - it's just that today's disks include lots of internal fault-recovery mechanisms that hide things from you, unless you run SMART diagnostics (and not just the basic "smart status" either).

For example, if you have a machine that's suddenly running VERY slowly - it's good sign that a drive is experiencing internal read errors (unless it's a laptop - a shorted battery is a good suspect). Both are lessons learned the hard way, and not forgotten.

Turns out that modern drives have onboard processors that retry reads multiple times - good for protecting data if you only have the one copy on that drive, at the expense of reduced disk access times. Not so good if:

a. you don't notice that it's happening (the disk will eventually fail hard), or,

b. you're running RAID - instead of the drive dropping out of the array, the entire array slows down as it waits for the failing drive to (eventually) respond

In either case, you'll tear your hair out trying to figure out why your machine is running slowly (is it a virus, a file lock that didn't release, etc., etc., etc.).

Lessons learned:

- if your machine is running really slowly, try a reboot -- if it reboots properly, but takes 2 times as long (or longer) to shutdown and then come back up -- get very suspicious (if your patience lasts that long)

- if it's a laptop - pull the battery and try again - if everything is normal, buy yourself a new battery

- if it's a server - try booting from a liveCD (if you can, first disconnect the hard drive entirely) - if normal then you could well have a hard drive problem (or you could have a virus)

- install SMART utilities and run "smartctl -A /dev/<your drive> -- the first line is usually the "raw read error" rate -- if the value (last entry on the line) is anything except 0, that's the sign that your drive is failing, if it's in the 1000s, failure is imminent, it's just that your drive's internal software is hiding it from you - replace it!

- if you're running RAID, be sure to purchase "enterprise" drives (where "desktop" try very hard to read a sector, despite the delay; enterprise drives give up quickly as they expect failure recovery to be handled by RAID)

- you would expect software raid (md) to detect slow drives, mark them bad, and drop them from an array -- nope, md does not keep track of delay

and, not really relevant for Debian, but a direct offshoot of learning the above lessons:

- if you're running a Mac or Windows, you're system may be reporting "smart status good" - but it's not really true - it's not looking at raw read errors

- there seems to be a bug in the smart utilities for Mac (as available through Macports and Fink) -- the smart daemon will fail periodically, with the only symptom being that every few minutes, you're machine will slow to a crawl (spinning beachball everywhere) for 30 seconds or so, then recover --- a really good example of taking a pre-emptive measure that causes a new problem (I can't tell you how long it took to track this one down - what with downloading every performance tracking tool I could find.)


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



Reply to: