Re: Failing disk advice

On Mon, Mar 6, 2017 at 8:27 PM, Gregory Seidman <gsslist+debian@anthropohedron.net> wrote:

On Mon, Mar 06, 2017 at 12:17:03PM +0100, Mirko Parthey wrote:
> On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> > On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> > >I have a disk that is reporting SMART errors. It is an active disk in
> > >a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> > >in the RAID1, and md hasn't decided it should fail the disk and switch
> > >to the hot spare. Should I proactively tell md to fail the disk (and
> > >let the hot spare take over), or should I just wait until md notices a
> > >problem?
> >
> > I'm confused by "I also have a hot spare in the RAID1". Do you have a
> > two-member RAID1 with a hot spare, or a three-member RAID1? I would
> > prefer the latter:
> >
> > https://manpages.debian.org/jessie/mdadm/md.4.en.html
>
> Refining this advice a bit, I would convert the spare to a full RAID
> member now, without explicitly failing the disk that reports SMART
> errors first.
> Assuming you have a two-member RAID1 with a hot spare, the command
> should be similar to this (untested):
> mdadm -G /dev/mdX -n 3
> This ensures you keep redundancy during further maintenance actions.

I was unaware that this was possible. I've run it and mdadm -D reports that
it is now in the "clean, degraded, rebuilding" state. Thank you! I wish I
had room in my system to add the fourth (which I've ordered) without
removing the failing disk, but I do not.

> Which SMART errors do you get, and who reports them?

I get emails sent to root:

This message was generated by the smartd daemon running on:

host name: XXXXXX
DNS domain: YYYYYY

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors

Device info:
ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Dec 14 00:51:36 2016 EST
Another message will be sent in 24 hours if the problem persists.

...and...

This message was generated by the smartd daemon running on:

host name: XXXXXX
DNS domain: YYYYYY

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors

Device info:
ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Dec 14 00:51:37 2016 EST
Another message will be sent in 24 hours if the problem persists.

(Yes, I know, I've been letting it do this since mid-December, which is not
great.)

> What is the output of the following command for the failing drive?
> smartctl -A /dev/sdY

# smartctl -A /dev/sdc
smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 205161943
3 Spin_Up_Time 0x0003 100 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1055
5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 41
7 Seek_Error_Rate 0x000f 092 060 030 Pre-fail Always - 1743842168
9 Power_On_Hours 0x0032 039 039 000 Old_age Always - 53898
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 85
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 097 097 000 Old_age Always - 3
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 133146017827
189 High_Fly_Writes 0x003a 007 007 000 Old_age Always - 93
190 Airflow_Temperature_Cel 0x0022 060 040 045 Old_age Always In_the_past 40 (Min/Max 26/45 #502)
194 Temperature_Celsius 0x0022 040 060 000 Old_age Always - 40 (0 18 0 0 0)
195 Hardware_ECC_Recovered 0x001a 038 023 000 Old_age Always - 205161943
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 53897 (15 186 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 917595486
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1262569510

> Regards,
> Mirko

Thanks for the help so far,
--Greg