[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?



On 2024-01-09 at 14:01, Michael Kjörling wrote:

> On 9 Jan 2024 13:25 -0500, from wanderer@fastmail.fm (The Wanderer):
> 
>>>> Within the past few weeks, I got root-mail notifications from 
>>>> smartd that the ATA error count on two of the drives had
>>>> increased - one from 0 to a fairly low value (I think between
>>>> 10 and 20), the other from 0 to 1. I figured this was nothing
>>>> to worry about - because of the relatively low values, because
>>>> the other drives had not shown any such thing, and because of
>>>> the expected stability and lifetime of good-quality SSDs.
>>>> 
>>>> On Sunday (two days ago), I got root-mail notifications from 
>>>> smartd about *all* of the drives in the array. This time, the
>>>> total error counts had gone up to values in the multiple
>>>> hundreds per drive. Since then (yesterday), I've also gotten
>>>> further notification mails about at least one of the drives
>>>> increasing further. So far today I have not gotten any such
>>>> notifications.
>> 
>> Do you read the provided excerpt from the SMART data as indicating
>> that there are hundreds of bad blocks, or that they are rising
>> rapidly?
> 
> No; that was your claim, in the paragraph about Sunday's events.

That paragraph was about the Uncorrectable_Error_Cnt value, which I do
not understand to directly reflect a count of bad blocks. That's why I
wanted to clarify; if you *do* understand that to directly reflect bad
blocks, I'd like to understand your thinking in arriving at that
understanding, and if alternately you were reaching that conclusion from
other sources, I'd like to know how and from what, because it would be
something I've missed.

>> The Runtime_Bad_Block count for that drive is nonzero, but it is
>> only 31.
>> 
>> What's high and seems as if it may be rising is the 
>> Uncorrectable_Error_Cnt value (attribute 187) - which I understand
>> to represent *incidents* in which the drive attempted to read a
>> sector or block and was unable to do so.
> 
> The drive may be performing internal housekeeping and in doing so
> try to read those blocks, or something about your RAID array setup
> may be doing so.
> 
> Exactly what are you using for RAID-6? mdraid? An off-board hardware 
> RAID HBA? Motherboard RAID? Or something else? What you say suggests 
> mdraid or something similar.

mdraid, yes.

>> I've ordered a 22TB external drive for the purpose of creating such
>> a backup. Fingers crossed that things last long enough for it to
>> get here and get the backup created.
> 
> I suggest selecting, installing and configuring (as much as
> possible) whatever software you will use to actually perform the
> backup while you wait for the drive to arrive. It might save you a
> little time later. Opinions differ but I like rsnapshot myself; it's
> really just a front-end for rsync, so the copy is simply files,
> making partial or full restoration easy without any special tools.

My intention was to shut down everything that normally runs, log out as
the user who normally runs it, log in as root (whose home directory,
like the main installed system, is on a different RAID array with
different backing drives), and use rsync from that point. My
understanding is that in that arrangement, the only thing accessing the
RAID-6 array should be the rsync process itself.

For additional clarity: the RAID-6 array is backing a pair of logical
volumes, which are backing the /home and /opt partitions. The entire
rest of the system is on a series of other logical volumes which are
backed by a RAID-1 array, which is based on entirely different drives
(different model, different form factor, different capacity, I think
even different connection technology) and which has not seen any
warnings arise.

>> dmesg does have what appears to be an error entry for each of the
>> events reported in the alert mails, correlated with the devices in
>> question. I can provide a sample of one of those, if desired.
> 
> As long as the drive is being honest about failures and is reporting 
> failures rapidly, the RAID array can do its work. What you
> absolutely don't want to see is I/O errors relating to the RAID array
> device (for example, with mdraid, /dev/md*), because that would
> presumably mean that the redundancy was insufficient to correct for
> the failure. If that happens, you are falling off a proverbial
> cliff.

Yeah, *that* would be indicative of current catastrophic failure. I have
not seen any messages related to the RAID array itself.


(For awareness: this is all a source of considerable psychological
stress to me, to an extent that is leaving me on the edge of physically
ill, and I am managing to remain on the good side of that line only by
minimizing my mental engagement with the issue as much as possible. I am
currently able to read and respond to these mails without pressing that
line, but that may change at any moment, and if so I will stop replying
without notice until things change again.)

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: