[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: using ddrescue on the root partition - boot with / as read-only



On 9/15/23 05:46, Vincent Lefevre wrote:
On 2023-09-14 22:24:59 -0700, David Christensen wrote:
On 9/14/23 03:17, Vincent Lefevre wrote:
I get UNC errors like

2023-09-10T11:50:59.858670+0200 zira kernel: ata1.00: exception Emask 0x0 SAct 0xc00 SErr 0x40000 action 0x0
2023-09-10T11:51:00.117366+0200 zira kernel: ata1.00: irq_stat 0x40000008
2023-09-10T11:51:00.117431+0200 zira kernel: ata1: SError: { CommWake }
2023-09-10T11:51:00.117474+0200 zira kernel: ata1.00: failed command: READ FPDMA QUEUED
2023-09-10T11:51:00.117511+0200 zira kernel: ata1.00: cmd 60/00:50:b8:12:c5/02:00:1f:00:00/40 tag 10 ncq dma 262144 in
                                                        res 41/40:00:90:13:c5/00:02:1f:00:00/00 Emask 0x409 (media error) <F>
2023-09-10T11:51:00.117537+0200 zira kernel: ata1.00: status: { DRDY ERR }
2023-09-10T11:51:00.117560+0200 zira kernel: ata1.00: error: { UNC }
2023-09-10T11:51:00.117583+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
2023-09-10T11:51:00.117614+0200 zira kernel: ata1.00: supports DRM functions and may not be fully accessible
2023-09-10T11:51:00.117651+0200 zira kernel: ata1.00: configured for UDMA/133
2023-09-10T11:51:00.117681+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
2023-09-10T11:51:00.117953+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Sense Key : Medium Error [current]
2023-09-10T11:51:00.118165+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
2023-09-10T11:51:00.118366+0200 zira kernel: sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 1f c5 12 b8 00 02 00 00
2023-09-10T11:51:00.118557+0200 zira kernel: I/O error, dev sda, sector 533009296 op 0x0:(READ) flags 0x80700 phys_seg 37 prio class 2
2023-09-10T11:51:00.118582+0200 zira kernel: ata1: EH complete
2023-09-10T11:51:00.118608+0200 zira kernel: ata1.00: Enabling discard_zeroes_data

What is the make and model of the laptop?

HP ZBook 15 G2 (2015)


That is a good laptop.



What is the make and model of the disk drive?

Samsung 870 EVO 1TB SATA (since January 2022)


That is a good SSD.



When and where do you see the above error messages?

It seems that this occurs when bad sectors are read, either when some
files (using these bad sectors) are read or when I use the badblocks
utility (until now, I've used it only with the read test, i.e. with
no options). The messages appear in the journalctl output.


Okay.



and after these errors, the kernel remount the root partition as
read-only.

That sounds like a reasonable boot loader response to an OS drive error
during boot.

There are no errors during boot. Only when I read the affected files
or use badblocks, but only after some given number of errors.


Oops -- I misread "remount" as "mount".


Due to these errors, some files are unreadable.

badblocks says that there are 25252 bad blocks.


That number is large enough to make me worry.



I'm using ddrescue before doing anything else (mainly in case things
would go worse), but I would essentially be interested in knowing
which files are affected.

Was the computer working correctly in the past?

Yes, except a few days before the first disk errors on 6 December 2022:
I got crashes from time to time (which never happened before). About
2 hours before the first errors, I upgraded the kernel and the NVIDIA
drivers from 390.154 to 390.157. In the changelog of 390.157-1:

nvidia-graphics-drivers-legacy-390xx (390.157-1) unstable; urgency=medium

   * New upstream legacy branch release 390.157 (2022-11-22).
     * Fixed CVE-2022-34670, CVE-2022-34674, CVE-2022-34675, CVE-2022-34677,
       CVE-2022-34680, CVE-2022-42257, CVE-2022-42258, CVE-2022-42259.
       https://nvidia.custhelp.com/app/answers/detail/a_id/5415
       (Closes: #1025281)
     * Improved compatibility with recent Linux kernels.

   [ Andreas Beckmann ]
   * Refresh patches.
   * Rename the internally used ARCH variable which might clash on externally
     set values.
   * Use substitutions for ${nvidia-kernel} and friends (510.108.03-1).
   * Try to compile a kernel module at package build time (510.108.03-1).

  -- Andreas Beckmann <anbe@debian.org>  Sat, 03 Dec 2022 22:17:01 +0100

I'm wondering whether the crashes were due to the compatibility
with the kernel (which was the latest Debian/unstable one).


The sum total of the clues make me think the SSD is failing.



When did you first notice the error messages?  What was the computer doing
at the time?

I first got errors on 6 December 2022 when I was reading these files.
At that time, I identified 5 files, which I put in a
private/unreadable-files directory. Then everything was OK
until a few days ago, when I wanted to duplicate a big directory
(to try to reproduce a bug).

Did you make any changes to the computer (hardware, software, configuration,
apps, other) immediately prior to the start of the error messages?

See above (and no hardware change).

Does the computer now generate error messages?  Consistently?  What is it
doing when the error messages are generated?

I get errors only when I read some particular files.


I suggest:

1. Keep your backups safe. Run an incremental backup to get newer files that can be read. Forget about files that cannot be read.

2. If you do not have a backup of a file that cannot be read and you need that data, send the SSD to the manufacturer or a service for data recovery.

3.  Otherwise, get a SMART extended report for the SSD:

	# smartctl -x /dev/disk/by-id/ata-pick-the-correct-disk

4.  Get disk partitioning, etc., information for the SSD:

	# fdisk -l /dev/disk/by-id/ata-pick-the-correct-disk

	(Relevant LVM, LUKS, or other commands, as appropriate).

5. Use the SSD manufacturer diagnostic tool to gather information, run tests, update the firmware, secure erase the SSD, and test again.


If the manufacturer diagnostic cannot secure erase the SSD, physically destroy the SSD.


If the manufacturer diagnostic can secure erase the SSD, but the SSD cannot pass all tests, recycle the SSD.


If the manufacturer diagnostic can secure erase the SSD and the SSD can pass all tests, get an updated SMART report, pick a suitable use for the drive (ZFS cache device comes to mind), deploy the SSD, and monitor the SSD frequently going forward.


David


Reply to: