Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?

To: debian-user@lists.debian.org
Subject: Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
From: David Christensen <dpchrist@holgerdanske.com>
Date: Tue, 9 Jan 2024 14:34:39 -0800
Message-id: <[🔎] 7e9a7ec8-6d39-465d-bd29-e8795da7fe77@holgerdanske.com>
In-reply-to: <[🔎] 659D45EC.8020801@fastmail.fm>
References: <[🔎] 659D45EC.8020801@fastmail.fm>

On 1/9/24 05:11, The Wanderer wrote:

I have an eight-drive RAID-6 array of 2TB SSDs, built
back in early-to-mid 2021.

Within the past few weeks, I got root-mail notifications from smartd
that the ATA error count on two of the drives had increased ...

On Sunday (two days ago), I got root-mail notifications from smartd
about *all* of the drives in the array.

One thing I don't know, which may or may not be important, is whether
these alert mails are being triggered when the error-count increase
happens, or when a scheduled check of some type is run.

Please do a full backup to a portable HDD ASAP. Put that HDD off-site.Get another HDD and do a full backup. Then do incremental backupsdaily. After a week, two weeks, or a month, swap the drives. At somepoint, start destroying older backups to make room for new backups.

Please burn your most critical data to a high-quality optical media.Enable checksums by some means (extended attributes, checksum file involume root, etc.). Validate the burn using the checksums. Then burnand validate new critical data every week, two weeks, month, etc..Validate checksums periodically.

AIUI smartd runs periodically via systemd, Perhaps another reader canpost the incantation required to display the settings and/or locate pastSMART reports on disk.

You can always run smartctl manual to get a SMART report whenever youwant (I like the --xall/-x option):


# smartctl -x DEV

I've looked at the SMART attributes for the drives, and am having a hard
time determining whether or not there's anything worth being actually
concerned about here. Some of the information I'm seeing seems to
suggest yes, but other information seems to suggest no.

Reading SMART reports has a learning curve. STFW for the terms you donot understand. And, beware that different manufacturers with differentengineers make different long-term predictions based upon differentshort-term test data.

Looking at SMART reports over time for the same drive, looking fortrends, and noticing problems is exactly the right thing to do. You andsmartd did good. :-)

Most of the attributes are listed as of type "Old_age".

Samsung EVO 870 are good drives, but they are "consumer" drives -- e.g.intended for laptop/ desktop computers that are powered off orhibernating most of the time. The SMART report you attached showed a"Power_On_Hours" attribute value of 22286. Assuming an operationalspecification of 40 hours/week, that SSD has usage equivalent to 10.7years. So, it is old.

I don't know how to interpret the "Pre-fail" notation for the other
attributes.

AIUI "Pre-fail" indicates the drive is going to fail soon and should bereplaced.

My default plan is to identify an appropriate model and buy a pair of
replacement drives, but not install them yet; buy another two drives
every six months, until I have a full replacement set; and start failing
drives out of the RAID array and installing replacements as soon as one
either fails, or looks like it's imminently about to fail.

If you want 24x7 storage at minimum total cost of ownership, I suggest3.5" enterprise HDD's. I buy "new" or "open box" older model drives oneBay, the older the cheaper. They typically die within a month or runfor years. SAS has more features, yet can be cheaper (assuming you havecompatible hardware).

I prefer RAID-10 over RAID-5 or RAID-6 because IOPS scales up linearlywith the number of mirrors (spindles). So, if one mirror does 120 IOPS(7200 RPM), two mirrors do 240 IOPS, three do 360 IOPS, etc.. Also,resilvering is a direct disk-to-disk copy at sequential read and writespeeds. To get protection against two-device failure, you need 3-daymirrors; or, a hot spare and a time delay longer than resilvering timebetween failures.

Finally, depending upon your choice of RAID, volume management,filesystem, etc., you might be able to re-use those SSD's asaccelerators -- read cache, write cache, metadata, etc.. (This is easyon ZFS. Perhaps other readers with madm, LVM, btrfs, etc., can commenton SSD acceleration for those.)



David

Reply to:

Follow-Ups:
- Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
  - From: David Christensen <dpchrist@holgerdanske.com>
- Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
  - From: Michael Kjörling <2695bd53d63c@ewoof.net>

References:
- SMART Uncorrectable_Error_Cnt rising - should I be worried?
  - From: The Wanderer <wanderer@fastmail.fm>

Prev by Date: Re: Kernel compiling 6.5 and beyound
Next by Date: Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
Previous by thread: Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
Next by thread: Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
Index(es):
- Date
- Thread