[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: excessive CPU usage



On 29. sep. 2014 09:32, Julien boooo wrote:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     35888         307316
# 2  Short offline       Completed: read failure       90%     35887         330254
# 3  Extended offline    Completed: read failure       90%     35887         410646
Sorry for the verbiage, but you might have the clues you need to start reassigning sectors here, though I have on occasion seen that the LBA is reported erroneously by smartctl. You will find out soon enough if you do
----
$ hdparm  --read-sector 307316
$ hdparm --read-sector
330254
$ hdparm --read-sector 410646
---
If you get errors from the above commands, you need to reassign those sectors. If not, then smartctl may be reporting erroneously because the drive has not been able to store the correct value of the sector where the error occured in its internal log. Your numbers look good though (they do not look like a single "highest possible integer" that you would most likely get if the values are wrong).

If smartcl is in error, you need to find the error when they happened in your system logs. I.E. you need to find the bad sectors somewhere like /var/log/syslog (or is it /var/log/messages ? ) . I forget. grep for 'SAT' or 'ATA' in your logs.

It may also be that you have "lucked out", and the sectors have been written to, and thus reassigned automatically. This will make the next read from that sector succeed, if the drive is not totally beyond repair.

And, like I said at first, this is merely a stop-gap-while your drive is getting progressively worse, and %wa goes up in "top" (you never told us how much wait you have).

So your plan should be:

1) Back up everything
2) Order a new drive
3) muddle through while you wait for your replacement.

You should consider ordering TWO drives, and run them in a mirror. Then you can set error-timeout to 7 seconds and not experience such bad performance the next time a drive starts failing. DO NOT set that error timeout if you only have one drive, or chances of data-loss will increase.

Remember, if your drive is in warranty, a replacement is free.



Reply to: