[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Software RAID (was [OT] 19"/2U Cases)

> Reminds me of the Google report and how disk errors correlated with 
> SMART reporting errors: http://labs.google.com/papers/disk_failures.pdf

This paper is definitely an interesting read[1].  What google hoped to be able
to do was build a model for being able to predict what drives were likely
to fail.  What they said about SMART in particular was:

"Out of all failed drives, over 56% of them have no count in any of the
four strong SMART signals, namely scan errors, reallocation count, offline
reallocation, and probational count. In other words, models based only on
those signals can never predict more than half of the failed drives. Figure
14 shows that even when we add all remaining SMART parameters (except
temperature) we still find that over 36% of all failed drives had zero
counts on all variables."

So basically, given their massive amount of data, they say that the only
SMART errors that have high correlation with drive failures to be
interesting are scan errors, reallocation count, offline reallocation, and
probational count.  But only 44% of the drives that they had fail in their
study actually had errors in those categories.

On the other hand, I think their data and charts show that if you DO have
SMART errors in these categories, the probability of drive failure is
significantly increased.

My conclusion from google's conclusions :-) - just because SMART says
everything is OK, it might not be.  But if you have certain SMART errors
you should consider buying a new drive ASAP.

Take care,

[1] Of course, they left out the most interesting data; which drive models
were most reliable and most unreliable!
Dale E. Martin - dale@the-martins.org

Reply to: