[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Failing disk advice



Hello,

On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> >I have a disk that is reporting SMART errors.

What are the errors? Some are more serious, some less so.

> >It is an active disk in a (kernel, not hardware) RAID1
> >configuration. I also have a hot spare in the RAID1, and md
> >hasn't decided it should fail the disk and switch to the hot
> >spare. Should I proactively tell md to fail the disk (and let the
> >hot spare take over), or should I just wait until md notices a
> >problem?
> 
> AFAIK desktop disks and "enterprise RAID" disks degrade differently.
> When a desktop disk is having trouble reading a sector, it will retry
> many times before giving up because it is likely the data does not
> exist anywhere else.  But, an enterprise RAID disc will retry only a
> few times and then fail; because the data should exist elsewhere and
> hung reads are intolerable in enterprise environments.

What you're referring to here is SCT Error Recovery Control:

    https://en.wikipedia.org/wiki/Error_recovery_control

At one point it was common for it to be a configurable timeout on
most drives, but defaulting to disabled on drive models designed for
desktop use. As you say, the rationale would be that a desktop drive
was probably not in a RAID, so holds the only copy of the data, and
must go to heroic lengths if necessary to read data.

As the drive vendors started being more aggressive about segmenting
their product ranges into "desktop" and "enterprise", they removed
the ability to change the timeout from drives in their desktop ranges.

This has had a very bad side effect for those using desktop drives
in their RAIDs. When SCTERC is not configurable, the timeout is
usually longer than Linux's own block layer timeout. The drive will
be unresponsive for so long that Linux will think the link has died
and reset it or the whole controller. That can cause multiple drives
to be kicked from the MD array though there is nothing wrong with
them, leading to the array becoming inoperable.

This is probably the number one cause of "my array broke and won't
assemble again" posts to linux-raid and so the first question asked
is usually, "what are your timeouts set to?"

It is imperative that anyone using MD RAID checks that their drive
timeouts are set sensibly.

You can check a drive's timeout like this:

    # smartctl -l scterc /dev/sda
    smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

    SCT Error Recovery Control:
               Read:     70 (7.0 seconds)
              Write:     70 (7.0 seconds)

If it comes back like this:

    SCT Error Recovery Control:
               Read:     Disabled
              Write:     Disabled

then it means that SCTERC is supported but disabled, so just needs
setting, like so:

    # smartctl -q errorsonly -l scterc,70,70 /dev/sda

but if it comes back like:

    Warning: device does not support SCT Error Recovery Control command

then you have a problem as the drive does not support SCTERC and
will likely freeze up for several minutes trying to read a damaged
sector.

If you have drives that don't support SCTERC, and you can't replace
them for ones that do, then your next best course of action is to
increase Linux's own timeouts. 180 seconds seems to be enough:

    # echo 180 > /sys/block/sda/device/timeout

The drive will still seem to freeze up for minutes when encountering
an unreadable sector, but Linux will give it longer and you'll avoid
a link/controller reset that could affect other drives.

If you needed to set SCTERC or Linux drive timeout then you must
re-apply those settings at every boot.

> So, if you are using desktop disks in a RAID, you might need to
> manually intervene to compensate for the mismatch.

Adjusting the timeouts is normally all that would be necessary.

If I had a drive that had SCTERC unsupported and it started showing
signs of impending failure, and I had no hot spare, then I'd
probably get a new drive and replace it ASAP just because of the
hassle involved when it does fail. Chances are that failure is going
to happen at an inconvenient time, whereas I could do the
replacement at a time convenient to me.

If, like OP, I had a hot spare in the array then really it is a
no-brainer to me: promote the hot spare then remove the suspect drive.
Since it's a spare there is no time where the array lacks
redundancy. If you wait for the drive to fail then there will be a
period of no redundancy while the spare is brought it.

This does depend on what kind of SMART failure it is though. Some of
them are a concern but do not imply total device failure in the near
future.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting


Reply to: