Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?
On 1/9/24 05:11, The Wanderer wrote:
I have an eight-drive RAID-6 array of 2TB SSDs, built
back in early-to-mid 2021.
Within the past few weeks, I got root-mail notifications from smartd
that the ATA error count on two of the drives had increased ...
On Sunday (two days ago), I got root-mail notifications from smartd
about *all* of the drives in the array.
One thing I don't know, which may or may not be important, is whether
these alert mails are being triggered when the error-count increase
happens, or when a scheduled check of some type is run.
Please do a full backup to a portable HDD ASAP. Put that HDD off-site.
Get another HDD and do a full backup. Then do incremental backups
daily. After a week, two weeks, or a month, swap the drives. At some
point, start destroying older backups to make room for new backups.
Please burn your most critical data to a high-quality optical media.
Enable checksums by some means (extended attributes, checksum file in
volume root, etc.). Validate the burn using the checksums. Then burn
and validate new critical data every week, two weeks, month, etc..
Validate checksums periodically.
AIUI smartd runs periodically via systemd, Perhaps another reader can
post the incantation required to display the settings and/or locate past
SMART reports on disk.
You can always run smartctl manual to get a SMART report whenever you
want (I like the --xall/-x option):
# smartctl -x DEV
I've looked at the SMART attributes for the drives, and am having a hard
time determining whether or not there's anything worth being actually
concerned about here. Some of the information I'm seeing seems to
suggest yes, but other information seems to suggest no.
Reading SMART reports has a learning curve. STFW for the terms you do
not understand. And, beware that different manufacturers with different
engineers make different long-term predictions based upon different
short-term test data.
Looking at SMART reports over time for the same drive, looking for
trends, and noticing problems is exactly the right thing to do. You and
smartd did good. :-)
Most of the attributes are listed as of type "Old_age".
Samsung EVO 870 are good drives, but they are "consumer" drives -- e.g.
intended for laptop/ desktop computers that are powered off or
hibernating most of the time. The SMART report you attached showed a
"Power_On_Hours" attribute value of 22286. Assuming an operational
specification of 40 hours/week, that SSD has usage equivalent to 10.7
years. So, it is old.
I don't know how to interpret the "Pre-fail" notation for the other
attributes.
AIUI "Pre-fail" indicates the drive is going to fail soon and should be
replaced.
My default plan is to identify an appropriate model and buy a pair of
replacement drives, but not install them yet; buy another two drives
every six months, until I have a full replacement set; and start failing
drives out of the RAID array and installing replacements as soon as one
either fails, or looks like it's imminently about to fail.
If you want 24x7 storage at minimum total cost of ownership, I suggest
3.5" enterprise HDD's. I buy "new" or "open box" older model drives on
eBay, the older the cheaper. They typically die within a month or run
for years. SAS has more features, yet can be cheaper (assuming you have
compatible hardware).
I prefer RAID-10 over RAID-5 or RAID-6 because IOPS scales up linearly
with the number of mirrors (spindles). So, if one mirror does 120 IOPS
(7200 RPM), two mirrors do 240 IOPS, three do 360 IOPS, etc.. Also,
resilvering is a direct disk-to-disk copy at sequential read and write
speeds. To get protection against two-device failure, you need 3-day
mirrors; or, a hot spare and a time delay longer than resilvering time
between failures.
Finally, depending upon your choice of RAID, volume management,
filesystem, etc., you might be able to re-use those SSD's as
accelerators -- read cache, write cache, metadata, etc.. (This is easy
on ZFS. Perhaps other readers with madm, LVM, btrfs, etc., can comment
on SSD acceleration for those.)
David
Reply to: