[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: SMART Uncorrectable_Error_Cnt rising - should I be worried?



On 1/9/24 05:11, The Wanderer wrote:

I have an eight-drive RAID-6 array of 2TB SSDs, built
back in early-to-mid 2021.


Within the past few weeks, I got root-mail notifications from smartd
that the ATA error count on two of the drives had increased ...

On Sunday (two days ago), I got root-mail notifications from smartd
about *all* of the drives in the array.

One thing I don't know, which may or may not be important, is whether
these alert mails are being triggered when the error-count increase
happens, or when a scheduled check of some type is run.


Please do a full backup to a portable HDD ASAP. Put that HDD off-site. Get another HDD and do a full backup. Then do incremental backups daily. After a week, two weeks, or a month, swap the drives. At some point, start destroying older backups to make room for new backups.


Please burn your most critical data to a high-quality optical media. Enable checksums by some means (extended attributes, checksum file in volume root, etc.). Validate the burn using the checksums. Then burn and validate new critical data every week, two weeks, month, etc.. Validate checksums periodically.


AIUI smartd runs periodically via systemd, Perhaps another reader can post the incantation required to display the settings and/or locate past SMART reports on disk.


You can always run smartctl manual to get a SMART report whenever you want (I like the --xall/-x option):

# smartctl -x DEV


I've looked at the SMART attributes for the drives, and am having a hard
time determining whether or not there's anything worth being actually
concerned about here. Some of the information I'm seeing seems to
suggest yes, but other information seems to suggest no.


Reading SMART reports has a learning curve. STFW for the terms you do not understand. And, beware that different manufacturers with different engineers make different long-term predictions based upon different short-term test data.


Looking at SMART reports over time for the same drive, looking for trends, and noticing problems is exactly the right thing to do. You and smartd did good. :-)


Most of the attributes are listed as of type "Old_age".


Samsung EVO 870 are good drives, but they are "consumer" drives -- e.g. intended for laptop/ desktop computers that are powered off or hibernating most of the time. The SMART report you attached showed a "Power_On_Hours" attribute value of 22286. Assuming an operational specification of 40 hours/week, that SSD has usage equivalent to 10.7 years. So, it is old.


I don't know how to interpret the "Pre-fail" notation for the other
attributes.


AIUI "Pre-fail" indicates the drive is going to fail soon and should be replaced.


My default plan is to identify an appropriate model and buy a pair of
replacement drives, but not install them yet; buy another two drives
every six months, until I have a full replacement set; and start failing
drives out of the RAID array and installing replacements as soon as one
either fails, or looks like it's imminently about to fail.


If you want 24x7 storage at minimum total cost of ownership, I suggest 3.5" enterprise HDD's. I buy "new" or "open box" older model drives on eBay, the older the cheaper. They typically die within a month or run for years. SAS has more features, yet can be cheaper (assuming you have compatible hardware).


I prefer RAID-10 over RAID-5 or RAID-6 because IOPS scales up linearly with the number of mirrors (spindles). So, if one mirror does 120 IOPS (7200 RPM), two mirrors do 240 IOPS, three do 360 IOPS, etc.. Also, resilvering is a direct disk-to-disk copy at sequential read and write speeds. To get protection against two-device failure, you need 3-day mirrors; or, a hot spare and a time delay longer than resilvering time between failures.


Finally, depending upon your choice of RAID, volume management, filesystem, etc., you might be able to re-use those SSD's as accelerators -- read cache, write cache, metadata, etc.. (This is easy on ZFS. Perhaps other readers with madm, LVM, btrfs, etc., can comment on SSD acceleration for those.)


David


Reply to: