Re: I/O errors during RAID check but no SMART errors

To: debian-user@lists.debian.org
Subject: Re: I/O errors during RAID check but no SMART errors
From: Dan Ritter <dsr@randomstring.org>
Date: Tue, 8 Oct 2024 11:29:22 -0400
Message-id: <[🔎] 20241008152922.moa623a5qxdo2r2x@randomstring.org>
In-reply-to: <[🔎] ZwVIpoGy8qItsQIP@well-adjusted.de>
References: <[🔎] ZwVIpoGy8qItsQIP@well-adjusted.de>

Jochen Spieker wrote: 
> I have two disks in a RAID-1:
> 
> | $ cat /proc/mdstat
> | Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
> | md0 : active raid1 sdb1[2] sdc1[0]
> |       5860390400 blocks super 1.2 [2/2] [UU]
> |       bitmap: 5/44 pages [20KB], 65536KB chunk
> | 
> | unused devices: <none>
> 
> During the latest monthly check I got kernel messages like this:
> 
> | Oct 06 00:57:01 jigsaw kernel: md: data-check of RAID array md0
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: irq_stat 0x40000008
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: failed command: READ FPDMA QUEUED
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: cmd 60/80:d0:80:74:f9/08:00:2d:02:00/40 tag 26 ncq dma 1114112 in
> |                                         res 41/40:00:50:77:f9/00:00:2d:02:00/00 Emask 0x409 (media error) <F>
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: status: { DRDY ERR }
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: error: { UNC }
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: configured for UDMA/133
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Sense Key : Medium Error [current]
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 CDB: Read(16) 88 00 00 00 00 02 2d f9 74 80 00 00 08 80 00 00
> | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
> | Oct 06 14:27:11 jigsaw kernel: ata3: EH complete

If this happens once, it's just a thing that happened.

If it happens multiple times, it means that there's a hardware
error: sometimes a cable, rarely the SATA port, often the drive.

> The sector number mentioned at the bottom is increasing during the
> check.

So it repeats, and it's contiguous. That suggests a flaw in the
drive itself.


> The way I understand these messages is that some sectors cannot be read
> from sdb at all and the disk is unable to reallocate the data somewhere
> else (probably because it doesn't know what the data should be in the
> first place).

Yes.  

> The disk has been running continuously for seven years now and I am
> running out of space anyway, so I already ordered a replacement. But I
> do not fully understand what is happening.

The drive is dying, slowly. In this case it's starting with a
bad patch on a platter.


> Two of these message blocks end with this:
> 
> | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 10198068744
> 
> What does that mean for the other instances of this error? The data
> is still readable from the other disk in the RAID, right? Why doesn't md
> mention it? Why is the RAID still considered healthy? At some point I
> would expect the disk to be kicked from the RAID.

md will eventually do that, but not until it gets bad enough.
That could be quite noticeable.


> I unmounted the filesystem and performed a bad blocks scan (fsck.ext4
> -fcky) that did not find anything of importance (only "Inode x extent
> tree (at level 1) could be shorter/narrower"), and it also did not yield
> any of the above kernel messages. But another RAID check triggers these
> messages again, just with different sector numbers. The RAID is still
> healthy, though.

I don't think it is.

> Should this tell me that it is new sectors are dying all the time, or
> should this lead me to believe that a cable / the SATA controller is at
> fault? I don't even see any errors with smartctl:

If the sectors were effectively random, a cable fault would be
likely. If the sectors are contiguous or nearly-so, that's
definitely the disk.

 
> | SMART Attributes Data Structure revision number: 16
> | Vendor Specific SMART Attributes with Thresholds:
> | ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
> |   1 Raw_Read_Error_Rate     0x002f   199   169   051    Pre-fail  Always       -       81
> |   3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always       -       9100
> |   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       83
> |   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
> |   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
> |   9 Power_On_Hours          0x0032   016   016   000    Old_age   Always       -       61794
> |  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
> |  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
> |  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       82
> | 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       54
> | 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2219
> | 194 Temperature_Celsius     0x0022   119   116   000    Old_age   Always       -       33
> | 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
> | 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
> | 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
> | 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
> | 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       43


This looks like a drive which is old and starting to wear out
but is not there yet. The raw read error rate is starting to
creep up but isn't at a threshold.


> I am still waiting for the result of a long self-test.
> 
> Do you think I should do remove the drive from the RAID immediately? Or
> should I suspect something else is at faula?t I perfer not to run the
> risk of losing the RAID completely when I keep on running on one disk
> while the new one is being shipped. I do have backups, but it would be
> great if I didn't need to restore.

If the disk is a few days away from being replaced, I would not
bother shutting it off, but I would assume that it is not a full
mirror and somehow having the good disk fail would be bad.

-dsr-

Reply to:

Follow-Ups:
- Re: I/O errors during RAID check but no SMART errors
  - From: Jochen Spieker <ml@well-adjusted.de>
- Re: I/O errors during RAID check but no SMART errors
  - From: Michael Kjörling <c9bc136c6063@ewoof.net>

References:
- I/O errors during RAID check but no SMART errors
  - From: Jochen Spieker <ml@well-adjusted.de>

Prev by Date: I/O errors during RAID check but no SMART errors
Next by Date: Re: ifupdown and inet6 gateways for inet interfaces
Previous by thread: I/O errors during RAID check but no SMART errors
Next by thread: Re: I/O errors during RAID check but no SMART errors
Index(es):
- Date
- Thread