[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: can use of smartctl prevent disk probs?



Maurits van Rees wrote:
On Mon, Oct 31, 2005 at 02:12:32PM -0600, Hugo Vanwoerkom wrote:

Since failures with /dev/hdc I have been paying attention to smartctl.

It is a, how should one say it, interesting program that provides lots
of data.

But it is not clear to me how you can prevent disk failure other than
buying a disk with 5 year warranty, or something.


That also won't prevent disk failure.  It will just get you your money
back or get you a new hard disk.  If it fails your data is still lost.
A warranty is no replacement for a backup regime.  But yes, I would
sooner trust and buy a hard disk with a warranty of five years than of
one year. :)

Smartctl just gives you advance warning when your hard disk is at risk
of failure.  When you receive such a signal then make a backup and get
ready to insert a new hard disk.

I use logcheck so I get an hourly email about events on my system.
There are some messages from smartctl that I don't worry about, so I
let logcheck ignore them.  Here are the relevant lines that I have in
/etc/logcheck/ignore.d.workstation/local.  These probably don't work
well for your system, so adapt as you see fit.

^\w{3} [ :0-9]{11} [._[:alnum:]-]+ smartd\[[0-9]+\]: Device: /dev/hdb, SMART Usage Attribute: 3 Spin_Up_Time changed from ([89][0-9]|1[0-9]{2}) to ([89][0-9]|1[0-9]{2})$
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ smartd\[[0-9]+\]: Device: /dev/hdc, SMART Usage Attribute: 194 Temperature_Celsius changed from [0-9]{3} to [0-9]{3}$

These lines have been here for months, so for my system these messages
are no reason to run to the computer shop. ;) But I do make daily
backups of course.


That's helpful, thanks.
I note some more other than "Spin Up Time":
"Seek Time Performance" + "Run Out Cancel".

Funny things is they are pre-fail on a disk that I now use and is an "old" maxtor (Power Cycle Ct = 704).

But the disk that I don't trust any more is only 6 months old (Power Cycle Ct = 204) and doesn't have any of those except UDMA CRC Error Ct=92. It failed with I/O errors when I was running on it.

Is there a good site to find out wat those names actually mean?

H











Reply to: