[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Failing disk advice



In my experience, you will likely be able to pull a few more weeks / months of life out of the drive but it will die. 
Mirko's suggestion of migrating to a n=3 raid1 setup is also what I would recommend.

You will notice in your smartctl output that Reallocated_Sector_Ct   is 41. That means that there have already been 41 sectors remapped to the spare sectors of your drive. The 8 offline_uncorrectable / current_pending_sector are probably unpopulated sectors that haven't been rewritten to yet, but triggered an i/o error the last time there was data there. For me this was often on a swap partition since there are a lot of transient writes. The next time the system tries to write to those sectors it will either fail and mark it as permanently unusable, or succeed and clear the pending count.

Good luck
Patrick

On Mon, Mar 6, 2017 at 8:27 PM, Gregory Seidman <gsslist+debian@anthropohedron.net> wrote:
On Mon, Mar 06, 2017 at 12:17:03PM +0100, Mirko Parthey wrote:
> On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> > On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> > >I have a disk that is reporting SMART errors. It is an active disk in
> > >a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> > >in the RAID1, and md hasn't decided it should fail the disk and switch
> > >to the hot spare. Should I proactively tell md to fail the disk (and
> > >let the hot spare take over), or should I just wait until md notices a
> > >problem?
> >
> > I'm confused by "I also have a hot spare in the RAID1".  Do you have a
> > two-member RAID1 with a hot spare, or a three-member RAID1?  I would
> > prefer the latter:
> >
> > https://manpages.debian.org/jessie/mdadm/md.4.en.html
>
> Refining this advice a bit, I would convert the spare to a full RAID
> member now, without explicitly failing the disk that reports SMART
> errors first.
> Assuming you have a two-member RAID1 with a hot spare, the command
> should be similar to this (untested):
>   mdadm -G /dev/mdX -n 3
> This ensures you keep redundancy during further maintenance actions.

I was unaware that this was possible. I've run it and mdadm -D reports that
it is now in the "clean, degraded, rebuilding" state. Thank you! I wish I
had room in my system to add the fourth (which I've ordered) without
removing the failing disk, but I do not.

> Which SMART errors do you get, and who reports them?

I get emails sent to root:

        This message was generated by the smartd daemon running on:

           host name:  XXXXXX
           DNS domain: YYYYYY

        The following warning/error was logged by the smartd daemon:

        Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors

        Device info:
        ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

        For details see host's SYSLOG.

        You can also use the smartctl utility for further investigation.
        The original message about this issue was sent at Wed Dec 14 00:51:36 2016 EST
        Another message will be sent in 24 hours if the problem persists.

...and...

        This message was generated by the smartd daemon running on:

           host name:  XXXXXX
           DNS domain: YYYYYY

        The following warning/error was logged by the smartd daemon:

        Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors

        Device info:
        ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

        For details see host's SYSLOG.

        You can also use the smartctl utility for further investigation.
        The original message about this issue was sent at Wed Dec 14 00:51:37 2016 EST
        Another message will be sent in 24 hours if the problem persists.

(Yes, I know, I've been letting it do this since mid-December, which is not
great.)

> What is the output of the following command for the failing drive?
>   smartctl -A /dev/sdY

        # smartctl -A /dev/sdc
        smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local build)
        Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

        === START OF READ SMART DATA SECTION ===
        SMART Attributes Data Structure revision number: 10
        Vendor Specific SMART Attributes with Thresholds:
        ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
          1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       205161943
          3 Spin_Up_Time            0x0003   100   091   000    Pre-fail  Always       -       0
          4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1055
          5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       41
          7 Seek_Error_Rate         0x000f   092   060   030    Pre-fail  Always       -       1743842168
          9 Power_On_Hours          0x0032   039   039   000    Old_age   Always       -       53898
         10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
         12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       85
        184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
        187 Reported_Uncorrect      0x0032   097   097   000    Old_age   Always       -       3
        188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       133146017827
        189 High_Fly_Writes         0x003a   007   007   000    Old_age   Always       -       93
        190 Airflow_Temperature_Cel 0x0022   060   040   045    Old_age   Always   In_the_past 40 (Min/Max 26/45 #502)
        194 Temperature_Celsius     0x0022   040   060   000    Old_age   Always       -       40 (0 18 0 0 0)
        195 Hardware_ECC_Recovered  0x001a   038   023   000    Old_age   Always       -       205161943
        197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
        198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
        199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
        240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       53897 (15 186 0)
        241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       917595486
        242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1262569510

> Regards,
> Mirko

Thanks for the help so far,
--Greg



Reply to: