AMD GPU hard lockups

To: debian-user <debian-user@lists.debian.org>
Subject: AMD GPU hard lockups
From: Celejar <celejar@gmail.com>
Date: Tue, 1 Aug 2023 11:44:48 -0400
Message-id: <[🔎] 20230801114448.68bbaf1439091ea4452cc0fa@gmail.com>

Hello,

I have a system running Debian unstable with an AMD RX-570. It has been
working fine for a while, but recently, anything that uses the more
advanced features of the GPU causes the system to hard lockup: black
screen, no response to keyboard, no network connectivity.

I'm not sure exactly which functionality of the GPU causes this:
ordinary web browsing, development work, etc. never cause problems, but
games, the Unigine Heaven benchmark, and even glmark2 invariably do,
sometimes immediately, sometimes after a few seconds or minutes.

I'm not sure exactly when this began: I hadn't been using the system
for any of the problematic tasks for a while.

I've tried looking in the logs. Running 'journalctl -b -1' after a
lockup generally shows nothing. I've tried to catch the error with
'tail -F /var/log/syslog', and most of the time I see nothing (just the
hang, with no warning in the log), but once I caught this:

2023-08-01T10:41:23.531381-04:00 lucy kernel: [38532.241396] gmc_v8_0_process_interrupt: 15 callbacks suppressed
2023-08-01T10:41:23.531394-04:00 lucy kernel: [38532.241401] amdgpu 0000:02:00.0: amdgpu: GPU fault detected: 147 0x06508401 for process heaven_x64 pid 14771 thread heaven_x64:cs0 pid 14792
2023-08-01T10:41:23.531395-04:00 lucy kernel: [38532.241407] amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x060004CA
2023-08-01T10:41:23.531396-04:00 lucy kernel: [38532.241408] amdgpu 0000:02:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02084001
2023-08-01T10:41:23.531396-04:00 lucy kernel: [38532.241409] amdgpu 0000:02:00.0: amdgpu: VM fault (0x01, vmid 1, pasid 32778) at page 100664522, read from 'TC7' (0x54433700) (132)
2023-08-01T10:41:23.531397-04:00 lucy kernel: [38532.241429] DMAR: DRHD: handling fault status reg 2
2023-08-01T10:41:23.531398-04:00 lucy kernel: [38532.241433] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbff5000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531399-04:00 lucy kernel: [38532.241438] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbfd1000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531400-04:00 lucy kernel: [38532.241442] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbffd000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531409-04:00 lucy kernel: [38532.241445] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfffe0000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531409-04:00 lucy kernel: [38532.241449] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbffd000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531410-04:00 lucy kernel: [38532.241453] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xffff8000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531411-04:00 lucy kernel: [38532.241456] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfffd8000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531412-04:00 lucy kernel: [38532.241460] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xfbbc1000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531412-04:00 lucy kernel: [38532.241460] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
2023-08-01T10:41:23.531413-04:00 lucy kernel: [38532.241464] DMAR: [DMA Write NO_PASID] Request device [02:00.0] fault addr 0xeffc0000 [fault reason 0x05] PTE Write access is not set
2023-08-01T10:41:23.531414-04:00 lucy kernel: [38532.241477] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
2023-08-01T10:41:23.531415-04:00 lucy kernel: [38532.241482] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00040000/00000000
2023-08-01T10:41:23.531415-04:00 lucy kernel: [38532.241487] pcieport 0000:00:02.0:    [18] MalfTLP                (First)
2023-08-01T10:41:23.531416-04:00 lucy kernel: [38532.241492] pcieport 0000:00:02.0: AER:   TLP Header: 00001000 020024ff aaa800c0 00000000
2023-08-01T10:41:23.531417-04:00 lucy kernel: [38532.241500] [drm] PCI error: detected callback, state(2)!!

I've found similar reports online, e.g.:

https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp
https://forums.linuxmint.com/viewtopic.php?t=380748
https://gitlab.freedesktop.org/drm/amd/-/issues/2358

But I'm really not clear whether these represent the same problem, or
are just different variations of a more general driver / firmware
problem. (I'm assuming it's software / firmware, since everything
worked fine previously, although I suppose it's possible that something
physical has broken in the hardware.)

Any ideas?

-- 
Celejar

Reply to:

Follow-Ups:
- Re: AMD GPU hard lockups
  - From: piorunz <piorunz@gmx.com>

Prev by Date: Re: NetworkManager, Iphone, and Bookworm
Next by Date: Re: AMD GPU hard lockups
Previous by thread: Re: singularity-container in bookworm?
Next by thread: Re: AMD GPU hard lockups
Index(es):
- Date
- Thread