Bug#1091893: linux-image-6.1.0-28-amd64: Watchdog detected hard LOCKUP on CPU 8, then CPU 0
On Wed, 8 Jan 2025 22:15:48 +0100
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> Hello Neal,
>
> On Wed, Jan 01, 2025 at 11:18:37PM -0500, Neal Murphy wrote:
> > Package: src:linux
> > Version: 6.1.119-1
> > Severity: critical
> > Justification: breaks the whole system
> >
> > Dear Maintainer,
> >
> > I plugged in my SSK NVME-to-USB3 adapter. I mounted it, checked it (without
> > writing anything), and unmounted it. The system displayed the '... has data to
> > be written ...' msg for quite a while. Around then, the system displayed the
> > watchdog error on CPU 8. Shortly after, it displayed a watchdog error on CPU 0
> > and the system became unresponsive requiring a hard reset.
> >
> > When I got the SSK, it worked well on the desktop. Months later, I had problems
> > with it, but didn't get any kernel oopses. The drive works OK on my Asus
> > laptop, so I'm beginning to suspect my desktop's hardware.
> >
> > I'm reporting this because flaky hardware usually shouldn't cause a system
> > lockup.
>
> This isn't only half of the truth. In an ideal world it would be true,
> but in reality this often doesn't work.
>
> There is another bugreport that looks quite similar to yours:
> https://lore.kernel.org/all/bug-219532-208809@https.bugzilla.kernel.org%2F/.
> The currently last message in that thread (from Dec 1, 22:07) has a
> patch. It would be great if you could test that and report upstream.
>
> Best regards
> Uwe
Hmmm. It's definitely a hardware (mainboard) issue of some kind.
Running Linux 6.11.5 from backports.
------------------------------
The device works fine plugged into a USB3.2 port in the back of the computer. It will mount and umount rapidly many times. I can read many GiB of data from it. I can write 10 GiB of data to it. I can let it sit idle for some minutes. No errors appear in syslog.
Plugged into one of the front USB3 ports, it works fine. For about a minute. Then the system produces variations of the following:
----
2025-01-09T03:20:26.514887-05:00 playground kernel: [596625.269156] sd 9:0:0:0: [sdd] tag#18 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
2025-01-09T03:20:26.514900-05:00 playground kernel: [596625.269479] sd 9:0:0:0: [sdd] tag#18 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
----
and more errors, finally unmounting and disconnecting the drive. The errors occur whether or not I do anything with the drive (read, mount, read-write files, unmount, etc.)
If I plug the drive into a front port and do nothing with it, the errors occur after about 30 seconds.
Importantly, the system does *not* hang/crash when running 6.11.5; the errors are handled well.
Linux 6.1.0
-----------
As for Bookworm's 6.1 kernel, while I might have better luck patching/building the 6.1.0-28 kernel (trying to build 6.11 from backports was a Borg-ish experience), I would gladly run an xhci module patched/built by someone familiar with the Debian build methodology; it is alien territory for me. (Well, provided that the patch noted above is easily applied to 6.1.) If it has lots of debugging built in, even better.
Thanks,
Neal
Reply to: