This is not directly Debian-related, except insofar as the system involved is running Debian, but we've already had a somewhat similar thread recently and this forum is as likely as any I'm aware of to have people who might have the experience to address the question(s). I would be open to recommendations for alternate / better forums for this inquiry, if people have such. For background: I have an eight-drive RAID-6 array of 2TB SSDs, built back in early-to-mid 2021. Until recently, as far as I'm aware there have not been any problems related to it. Within the past few weeks, I got root-mail notifications from smartd that the ATA error count on two of the drives had increased - one from 0 to a fairly low value (I think between 10 and 20), the other from 0 to 1. I figured this was nothing to worry about - because of the relatively low values, because the other drives had not shown any such thing, and because of the expected stability and lifetime of good-quality SSDs. On Sunday (two days ago), I got root-mail notifications from smartd about *all* of the drives in the array. This time, the total error counts had gone up to values in the multiple hundreds per drive. Since then (yesterday), I've also gotten further notification mails about at least one of the drives increasing further. So far today I have not gotten any such notifications. One thing I don't know, which may or may not be important, is whether these alert mails are being triggered when the error-count increase happens, or when a scheduled check of some type is run. If it's the latter, then it might be that there's a monthly check and that's the reason why all eight drives got mails sent at once, but if it's the former, then the so-close-in-time alerts from all eight drives would seem more likely to reflect a real problem. I've looked at the SMART attributes for the drives, and am having a hard time determining whether or not there's anything worth being actually concerned about here. Some of the information I'm seeing seems to suggest yes, but other information seems to suggest no. Relevant-seeming excerpts from the output of 'smartctl -a' on one of the drives is attached (rather than inline, to avoid line-wrapping). I can provide full output of that command for that drive, or even for all of the drives, if desired. Things that seem to suggest that there may be reason to be concerned include, but may not be limited to: The Uncorrectable_Error_Cnt, which is the value referenced by the alert mails, has risen well above its apparent previous value of 0, and signs are that it may be going to keep rising. The Runtime_Bad_Block count is nonzero. The ECC_Error_Rate is nonzero (and, at least in the case of this specific drive, also equal to the Uncorrectable_Error_Cnt). Most of the attributes are listed as of type "Old_age". That strikes me as unexpected; two and a half years of mostly-read-based operation does not seem like enough to qualify a SSD as "old", although my expectations here may well be off. (I would be inclined to expect five-to-ten years of operation out of a non-defective drive, assuming reasonable physical treatment otherwise, if not considerably more.) As mentioned above, the increase in Uncorrectable_Error_Cnt has happened at nearly the same time (relative to drive installation date) for all the drives, and for some of the drives it seems to be continuing to increase. I don't know how to interpret the "Pre-fail" notation for the other attributes. That terminology could be intended to mean "This drive has entered the final stage before failure, and its failure is expected to be imminent" - or it could equally well be the status that the attributes *start* in, with the intended meaning "This drive has not yet reached a stage where there is any reason to think it might fail". Things that seem to suggest that there may *not* be a reason to be concerned include, but may not be limited to: The "VALUE" column for each of the attributes remains high; most are in the range from 098 to 100, and excluding the Airflow_Temperature_Cel figure, the lowest is 095, for Power_On_Hours. From what I've managed to find in reading online, this column is typically a percentage value, with lower percentages indicating that the drive is closer to failure. The Total_LBAs_Written value, when combined with the Sector Size, results (if my math is correct) in a total-data-written figure of between 3TB and 4TB. That should be *well* under the advertised write endurance of this drive, given that the drive is 2TB and (both IIRC and from what I've found in reading up on such things again after these errors started to occur) those advertised values for similar-capacity drives seem to start in the hundreds of TB and go up. So... as the Subject asks, should I be worried? How do I interpret these results, and at what point do they start to reflect something to take action over? If there is not reason to be worried, what *do* these alerts indicate, and at what point *should* I start to be worried about them? I already *am* worried, to the point of having heartburn and difficulty sleeping over the possibility of data loss (there's enough on here that external backup would be somewhat difficult to arrange), but I'm not sure whether or not that is warranted. My default plan is to identify an appropriate model and buy a pair of replacement drives, but not install them yet; buy another two drives every six months, until I have a full replacement set; and start failing drives out of the RAID array and installing replacements as soon as one either fails, or looks like it's imminently about to fail. But if the mass notification mails indicate that all eight are nearing failure, that might not be enough - and if they don't indicate any likelihood of failure this year, then buying replacement drives yet might be premature. What drives I choose to buy as replacement would also be influenced by how likely it is that this indicates impending failure. If it doesn't, then drives similar to what I already have would probably still be appropriate; if it does, then I'm going to want to go up-market and buy long-endurance drives intended for high uptime - i.e., data-center storage drives, which are likely to be more expensive. -- The Wanderer The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. -- George Bernard Shaw
Model Family: Samsung based SSDs Device Model: Samsung SSD 870 EVO 2TB Serial Number: S620NJ0R410888A LU WWN Device Id: 5 002538 f31440901 Firmware Version: SVT01B6Q User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database 7.3/5319 ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Jan 9 07:32:13 2024 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 098 098 010 Pre-fail Always - 31 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 22286 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 29 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 11 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 098 098 010 Pre-fail Always - 31 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 098 098 010 Pre-fail Always - 31 187 Uncorrectable_Error_Cnt 0x0032 099 099 000 Old_age Always - 598 190 Airflow_Temperature_Cel 0x0032 069 050 000 Old_age Always - 31 195 ECC_Error_Rate 0x001a 199 199 000 Old_age Always - 598 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 21 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 6950839497
Attachment:
signature.asc
Description: OpenPGP digital signature