Re: Dying hard drive?

To: debian-user@lists.debian.org
Subject: Re: Dying hard drive?
From: mfidelman <mfidelman@meetinghouse.net>
Date: Fri, 18 Oct 2013 12:40:49 -0400
Message-id: <[🔎] 9d6ooonclrv8qycb80yveg6g.1382114449317@email.android.com>
Reply-to: mfidelman <mfidelman@meetinghouse.net>

Sent from my Verizon Wireless 4G LTE Smartphone

-------- Original message --------
From: Veljko <veljko3@gmail.com>
Date: 10/18/2013 12:26 PM (GMT-05:00)
To: debian-user@lists.debian.org
Subject: Re: Dying hard drive?

Hello Miles,

On Fri, Oct 18, 2013 at 11:43:59AM -0400, Miles Fidelman wrote:
> Do a smartctl -A /dev/sd[abcd] - look for non-zero raw read errors
> and reallocated sector counts. I've found, at least for the WD
> drives I use in my servers - anything other than a 0 raw-read-error
> count is a sign of near-term disk failure. The first time I
> encountered the symptoms you report, it took me a LONG time to
> figure it out. The basic SMART test is useless.

IIUIC, this is output I should be looking:

sda:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   095   094   006    Pre-fail Always       -       230210521
5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail Always       -       5832

sdb:
1 Raw_Read_Error_Rate     0x000f   119   082   006    Pre-fail Always       -       234455192
5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail Always       -       2320

sdc:
1 Raw_Read_Error_Rate     0x000f   115   099   006    Pre-fail Always       -       87852008
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always       -       0

sdd:
1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail Always       -       187317944
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always       -       0

Acording to this, all drives are bad, but only sda behaves badly. Three more
values are reported as Pre-fail: Spin_Up_Time, Seek_Error_Rate and
Spin_Retry_Count. Full output for sda:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   095   094   006    Pre-fail Always       -       230231673
3 Spin_Up_Time            0x0003   097   097   000    Pre-fail Always       -       0
4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail Always       -       5832
7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail Always       -       471820319
9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       9750
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       4
183 Runtime_Bad_Block       0x0032   098   098   000    Old_age   Always       -       2
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       386
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   063   056   045    Old_age   Always       -       37 (Min/Max 22/44)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1266
194 Temperature_Celsius     0x0022   037   044   000    Old_age   Always       -       37 (0 22 0 0)
197 Current_Pending_Sector 0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       174414326932768
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       23166370191361
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       174661697951516

> Note that this particularly applies if you're not using an
> enterprise-class drive. Standard drives try very hard to read from
> the medium, and take a long time before they give up. Enterprise
> drives assume they're part of a RAID array and just give up,
> throwing an error.

I'm using Seagate ST3000DM001-9YN166 drives, not enterprise-class drives.

Ok... seagate probably writes error codes in that field. Try googling the model number and smart, or raw_read_error. That might help you track things down. I guess you could also try soft-removing one drive at a time from your array to isolate the bad drive.

Reply to:

Follow-Ups:
- Re: Dying hard drive?
  - From: Veljko <veljko3@gmail.com>

Prev by Date: Re: Logout, shut down autostarted app in openbox?
Next by Date: Re: sysadmin qualifications (Re: apt-get vs. aptitude)
Previous by thread: Re: Dying hard drive?
Next by thread: Re: Dying hard drive?
Index(es):
- Date
- Thread