[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: impending disk failure?



See below

Tony van der Hoff wrote:
On 17/10/15 17:47, Miles Fidelman wrote:
Dominique Dumont wrote:
On Saturday 17 October 2015 14:15:52 Tony van der Hoff wrote:
Can anyone please explain what it means, and whether I should be
worried?
You should check the drive with smartctl.

See http://www.smartmontools.org/

HTH

Yes.. and be sure to go beyond the basic tests.

First off, make sure it's running:
smartctl -s on -A /dev/disk0   ;for each drive, and using the
appropriate /dev/..

Then after, it's accumulated some stats:
smartctl -A /dev/disk0

For a lot of drives, the first line - raw read errors, can be very
telling - anything other than 0, and your disk is failing.
Start-up-time can be telling, if it's increasing.

The thing is, that most drives, except those designed for use in RAID
arrays, mask impending disk failures, by re-reading blocks multiple
times - they often get the data eventually, but your machine keeps
getting slower and slower.



Thanks Miles, and tomás, for your helpful replies.

I apologise for the delay in replying, but I've been away from my desk a few days.

I have however been doing some extensive googling, and it would appear that the raw read error count is something of a red herring, especially when applied to Seagate drives, as these are. Both my drives have quite high (in the millions) of RREC; numbers which are precisely matched by the Hardware ECC Recovered counts, suggesting that the RREC is merely an artifact od HHDs being essentially a mechanical device, being pushed to its limits using clever technology. The SMART extended tests reveal no problems.

The Wikipedia entry https://en.wikipedia.org/wiki/S.M.A.R.T. is particularly informative in the relative importance of these error counts; the RREC can be safely ignored, as somebody else here recently suggested.

You're missing the point.

As the Wikipedia also points out:
<https://en.wikipedia.org/wiki/S.M.A.R.T.#cite_note-seagate1-2>"Mechanical failures account for about 60% of all drive failures." and "Further, 36% of drives failed without recording any S.M.A.R.T. error at all, except the temperature, meaning that S.M.A.R.T. data alone was of limited usefulness in anticipating failures."

Today's disk drives are designed to PROTECT DATA, AND MAINTAIN ACCESS TO DATA, until the very moment before the drive fails catastrophically. The "Hardware ECC Recovered Count" indicates that: - there are likely to be problems with the underlying media that the ECC is recovering from, that will only get worse over time - the recovery takes time, hence the reason you system is slowing down - the more underlying errors, the more time it takes to recover

I've never found SMART extended tests to be indicative of anything, until a disk is nearly dead. Though http://www.z-a-recovery.com/manual/smart.aspx gives a good list of other SMART variables that might indicate mechanical failures.

If your drives are a couple of years old, and your machine is getting slower, don't engage in wishful thinking - backup and get new drives.

Miles

--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply to: