Bug#1076372: Observations, Logs, and Testing

To: 1076372@bugs.debian.org, tj.iam.tj@proton.me
Subject: Bug#1076372: Observations, Logs, and Testing
From: Stefan <debian@simg.de>
Date: Tue, 26 Nov 2024 12:58:24 +0100
Message-id: <[🔎] 6d1cf56c-a8ac-4a8e-a354-aaddd739d4eb@simg.de>
Reply-to: Stefan <debian@simg.de>, 1076372@bugs.debian.org
In-reply-to: <[🔎] 3329850f-e92b-4c5b-b3a0-4c4be49d44f7@simg.de>
References: <172104075060.7102.3621600478475051128.reportbug@ws7> <[🔎] 173050557877.2947277.14281153347574216764.reportbug@sunny> <172104075060.7102.3621600478475051128.reportbug@ws7> <[🔎] 65e0eac0-b7e3-4586-a163-935592a3ed0e@simg.de> <172104075060.7102.3621600478475051128.reportbug@ws7> <[🔎] 3329850f-e92b-4c5b-b3a0-4c4be49d44f7@simg.de> <172104075060.7102.3621600478475051128.reportbug@ws7>


Hi,

here are more test results.

Configuration 1: Kingston read error in 1st M.2 socket:

* I ran f3 tests in a loop for several days. The read errors with
  6.1.112 kernel could only be reproduced one time. (f3 reported 5120
  corrupted 512 byte sectors in one 1GB file, which is not very
  informative)

Configuration 2: Lexar write errors in 1st M2 slot:

* Occur with Debian 6.11.5 kernel and with unmodified (self-compiled)
  6.11.5 kernel, but does not occur with 6.1. kernels.
* f3 reports overwritten sectors.
* I always tested with 500 files (= 500 GB). The last 150 or so files
  were never corrupted (overwritten). This could also explain why there
  are no file system errors, if these data are written at last (journal
  ?).

Furthermore I enclosed boot logs. The ones whose file names contain
"-deb" where produced with debian kernels. The log that ends with
"-deb1" was produced while Kingston SSD was in 1st M.2 socket and Lexar
SSD in 2nd one (Configuration 1). During all other logs Lexar SSD was in
1st socket (Configuration 2).

Regards Stefan


Am 15.11.24 um 13:10 schrieb Stefan:

Hi,

I did not received the test script. Therefore I tried to to test several
6.1 kernels with the f3. Without success, I could not reproduce the errors.

The corruption of the Kingston NVMe in first M.2 Socket with 6.1.94
kernel (6.1.0-22-amd64) was first noticed in a lengthy computation where
the SSD is used as cache and to store the final results. I saved the
erroneous results of this computations and re-analyzed them:

* The first results where correct
* The corruption appeared after a certain runtime
* The errors became more frequent as longer the PC ran

After I noticed the errors, I first ran a memory test (success) and then
the f3 test (failed). Unfortunately I did not logged the results.

Then, I booted into 6.10 kernel and re-run f3 (success). Furthermore I
successfully completed all computations mentioned above with that
kernel. That's why I have some time now for more testing.

My interpretation was, that the 6.1 kernel is responsible for the
errors. But since I cannot reproduce them, the errors may have been
disappeared after rebooting for another reason.

Without the Lexar issues I would say that the Kingston NVMe is defect ...

ATM I'm trying to reproduce (and log) then 6.1. issues of the Kingston
NVMe using f3 write:read loop (1:20).

While I'm running this (and probably more) test(s) next week, I propose
to rename the bug to something like "NVMe issues with ASRock DeskMini
X600 + Ryzen 8700G". If I do not find out more, I'll try to contact
ASRock and AMD, refer to this bug, and ask for a statement whether is is
a Linux-incompatibility or and hardware error ...


Regarding to your questions and other observations:

* sheduler in all cases (NVMe's, kernels) is "[none] mq-deadline"
* I collected boot logs. But none of these boots resulted into an error.
   I can send the logs per PM. But I think without comparison this make
   no sense
* I found the kernel log during which the errors occurred (the boot part
   was older and already had been deleted). There I found the messages
   attached in `pcierr.txt`. But these messages only occurred one time
   while the corruptions where quite frequent. Furthermore it is another
   device (lspci output attached in `lspci.txt`)
* Error counts in `/sys/bus/pci/drivers/nvme/*/aer*` are all zero
   (while I'm running the f3 loop)

Regards Stefan



Am 02.11.24 um 13:58 schrieb Stefan:


Hi,

currently the Lexar SSD is installed in rear M.2 socket and
Kingston is installed in primary one. (With 6.10 kernel that
configuration works).

ATM I would like to focus on 6.1 read issues with the Kingston SSD in
primary M.2 socket, because that can be done remotely and without
removing the Kingson SSD. (I need both SSD's ATM and I do not want the
install the Kingson one in the rear socket,  because I would have to
breake a warranty seal in order to remove the heat sink.)

The unptached 6.1.112 kernel is compiling while I write this Email. So
please send me your test script, then I would test the unpatched kernel
vs. the Debian variant, collect the infos+logs you asked for, and run
your script and f3.

Regards Stefan


Am 02.11.24 um 00:59 schrieb Tj:

Package: linux-image-6.11.5+debian+tj
Version: 6.11.5-265
Followup-For: Bug #1076372
X-Debbugs-Cc: tj.iam.tj@proton.me

Thanks for the response - very useful.

I suspect the cause may be a conjunction of several issues that in
themselves are relatively minor.

Observations:

1. That Lexar has the Maxio 1602 controller. There were quite a lot of
problems in-kernel with that controller and it took a while to resolve
those (device didn't initialise within timeout) with a quirk (commit
a3a9d63dcd15535e7i v6.4) that is model specific since the problem
doesn't
affect all Maxio 1602 based devices (suggests a device firmware
incompatibility somewhere).

2. Lexar specific firmware bugs that require kernel workarounds suggest
the device firmware isn't entirely robust:
1231363aec86704a6b04 NM760
b65d44fa0fe072c91bf4 NM620

2. It appears with that mobo M2_1 Gen5 x4 might be part of the problem
mix here, since M2_2 Gen4 x4 seems to be fine.

Logs:

Can you attach a kernel log from boot to when system services are
started - this is so we can see exactly how the PCIe and NVME are
configuring. Doesn't matter which kernel version and doesn't matter if
the devices are currently in different sockets provided it shows the
Lexar device being initialised. `journalctl --dmesg --boot ${boot_id}`

Also, for the Lexar, what scheduler is in use?
`cat /sys/block/nvme*n?/queue/scheduler`

Testing:
I wrote a shell script that does something similar to f3; if I modify it
to target your situation would you be able to run it and report results?

My thinking is for it to pre-build 128KiB blocks in tmpfs with a
deterministic human-recognisable ASCII pattern in each then use dd to do
various write+read tests to a single large file and detect when
corruption
occurs, where, and if the timing or patterns used reveal how it has
gone wrong.

I'd also want it to be able to tell us where on the device (logical
block) address) the writes end up in case it is related.

As I said previously it is often possible to deduce what is going wrong
from the type of corruption. We need this to know where to focus
attention.

Attachment: logs.tar.gz
Description: application/gzip

Reply to:

References:
- Bug#1076372: Observations, Logs, and Testing
  - From: Tj <tj.iam.tj@proton.me>
- Bug#1076372: Observations, Logs, and Testing
  - From: Stefan <debian@simg.de>
- Bug#1076372: Observations, Logs, and Testing
  - From: Stefan <debian@simg.de>

Prev by Date: Bug#1086955: please enable Motorcomm YT6801 GBit NIC
Next by Date: Bug#1087940: (No Subject)
Previous by thread: Bug#1076372: Observations, Logs, and Testing
Next by thread: Bug#1086604: linux-image-6.11.4-amd64: amdgpu is unable to resume from suspend
Index(es):
- Date
- Thread