[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: system freeze



mick crane writes:

hello,
I frequently have the system freeze on me and I have to unplug it.
It seems to only happen in a browser and *appears* to be triggered by using the mouse. If watching streamed youtube movie or reading blogs sometimes the screen goes black and everything is unresponsive and sometimes the screen and everything freezes but the audio keeps playing.
I'd like it to stop doing that.
It didn't seem to be an issue a while ago but now is happening once at least per day with bullseye and now with bookworm.
I cannot find anything in logs that have looked for except

[...]

What steps can I take to isolate the problem ?

mick@pumpkin:~$ inxi -SGayz
System:
  Kernel: 5.16.0-6-amd64 arch: x86_64 bits: 64 compiler: gcc v: 11.2.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-5.16.0-6-amd64
    root=UUID=1b68069c-ec94-4f42-a35e-6a845008eac7 ro quiet
  Desktop: Xfce v: 4.16.0 tk: Gtk v: 3.24.24 info: xfce4-panel wm: xfwm
v: 4.16.1 vt: 7 dm: LightDM v: 1.26.0 Distro: Debian GNU/Linux bookworm/sid
Graphics:
  Device-1: AMD Pitcairn LE GL [FirePro W5000] vendor: Dell driver: radeon
    v: kernel alternate: amdgpu pcie: gen: 3 speed: 8 GT/s lanes: 16 ports:
    active: DP-1 empty: DP-2,DVI-I-1 bus-ID: 03:00.0 chip-ID: 1002:6809
    class-ID: 0300
  Display: x11 server: X.Org v: 1.21.1.3 compositor: xfwm v: 4.16.1 driver:
    X: loaded: radeon unloaded: fbdev,modesetting,vesa gpu: radeon
    display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x571mm (40.00x22.48")
    s-diag: 1165mm (45.88")
  Monitor-1: DP-1 mapped: DisplayPort-0 model: LG (GoldStar) HDR 4K
    serial: <filter> built: 2021 res: 3840x2160 hz: 60 dpi: 163 gamma: 1.2
    size: 600x340mm (23.62x13.39") diag: 690mm (27.2") ratio: 16:9 modes:
    max: 3840x2160 min: 640x480
  OpenGL: renderer: AMD PITCAIRN (DRM 2.50.0 5.16.0-6-amd64 LLVM 13.0.1)
    v: 4.5 Mesa 21.3.8 direct render: Yes

[...]

Hello,

I think I had a very similar issue some months ago (Debian Bullseye). Back then I tried to switch to the proprietary AMD driver (?) and it seems to have helped although on my machine, the problem appeared at most once or twice a day back then.

These were the symptoms I had observed:

* Random conditions (but always GUI application usage)
* Clock in i3bar hangs
* X11 mouse cursor can still move
* Shortly after the hang, screen turns black
* At least one program continues to run despite the
  graphics output being "off"
* SSH connection was not possible during this screen off
  state.

In later instances, I also observed that the screen turned black temporarily and turned on after a shorter freeze again with the system becoming usable again.

Here is my output for your inxi command:

$ inxi -SGayz | cat
System:
 Kernel: 5.10.0-13-amd64 x86_64 bits: 64 compiler: gcc v: 10.2.1
 parameters: BOOT_IMAGE=/boot/vmlinuz-5.10.0-13-amd64
 root=UUID=5d6c37b4-341f-4aca-a9f7-2c8a0f39336a ro quiet
 Desktop: i3 4.19.1-non-git info: i3bar, docker dm: startx
 Distro: Debian GNU/Linux 11 (bullseye)
Graphics:
 Device-1: AMD Navi 14 [Radeon Pro W5500] vendor: Dell driver: amdgpu
 v: 5.11.5.21.20 bus ID: 0000:67:00.0 chip ID: 1002:7341 class ID: 0300
 Display: server: X.Org 1.20.11 driver: loaded: amdgpu,ati
 unloaded: fbdev,modesetting,radeon,vesa display ID: :0 screens: 1
 Screen-1: 0 s-res: 7680x1440 s-dpi: 96 s-size: 2032x381mm (80.0x15.0")
 s-diag: 2067mm (81.4")
 Monitor-1: DisplayPort-0 res: 1920x1080 hz: 60 dpi: 93
 size: 527x296mm (20.7x11.7") diag: 604mm (23.8")
 Monitor-2: DisplayPort-1 res: 2560x1440 hz: 60 dpi: 109
 size: 597x336mm (23.5x13.2") diag: 685mm (27")
 Monitor-3: DisplayPort-2 res: 1280x1024 hz: 60 dpi: 96
 size: 338x270mm (13.3x10.6") diag: 433mm (17")
 Monitor-4: DisplayPort-3 res: 1920x1080 hz: 60 dpi: 85
 size: 575x323mm (22.6x12.7") diag: 660mm (26")
 OpenGL: renderer: AMD Radeon Pro W5500
 v: 4.6.14739 Core Profile Context FireGL 21.20 compat-v: 4.6.14739
 direct render: Yes

Back when the problem was still appearing, I could observe the following messages in syslog after reboot (sorry long lines...):

Sep 18 13:11:36 masysma-18 kernel: [ 2045.986736] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=3179, emitted seq=3181
Sep 18 13:11:36 masysma-18 kernel: [ 2045.986935] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 18 13:11:36 masysma-18 kernel: [ 2045.986944] amdgpu 0000:67:00.0: amdgpu: GPU reset begin!
Sep 18 13:11:38 masysma-18 kernel: [ 2047.719111] amdgpu 0000:67:00.0: amdgpu: failed send message: DisallowGfxOff (42) 	param: 0x00000000 response 0xffffffc2
Sep 18 13:11:38 masysma-18 kernel: [ 2047.719114] amdgpu 0000:67:00.0: amdgpu: Failed to disable gfxoff!
Sep 18 13:11:40 masysma-18 kernel: [ 2049.778328] amdgpu 0000:67:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Sep 18 13:11:41 masysma-18 kernel: [ 2051.441397] amdgpu 0000:67:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Sep 18 13:11:42 masysma-18 kernel: [ 2051.634059] amdgpu 0000:67:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 18 13:11:42 masysma-18 kernel: [ 2051.634111] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Sep 18 13:11:42 masysma-18 kernel: [ 2051.813223] amdgpu 0000:67:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 18 13:11:42 masysma-18 kernel: [ 2051.813267] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Sep 18 13:11:43 masysma-18 kernel: [ 2053.354507] amdgpu 0000:67:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Sep 18 13:11:43 masysma-18 kernel: [ 2053.354509] amdgpu 0000:67:00.0: amdgpu: Failed to disable smu features except BACO.
Sep 18 13:11:43 masysma-18 kernel: [ 2053.354511] amdgpu 0000:67:00.0: amdgpu: Fail to disable dpm features!
Sep 18 13:11:43 masysma-18 kernel: [ 2053.354570] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Sep 18 13:11:43 masysma-18 kernel: [ 2053.390519] [drm] free PSP TMR buffer
Sep 18 13:11:43 masysma-18 kernel: [ 2053.423405] amdgpu 0000:67:00.0: amdgpu: BACO reset
Sep 18 13:11:45 masysma-18 kernel: [ 2054.968108] amdgpu 0000:67:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Sep 18 13:11:45 masysma-18 kernel: [ 2054.968110] amdgpu 0000:67:00.0: amdgpu: Failed to enter BACO state!
Sep 18 13:11:45 masysma-18 kernel: [ 2054.968112] amdgpu 0000:67:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:67:00.0
Sep 18 13:11:45 masysma-18 kernel: [ 2054.968154] amdgpu 0000:67:00.0: amdgpu: GPU reset(1) failed
Sep 18 13:11:45 masysma-18 kernel: [ 2054.989004] amdgpu 0000:67:00.0: amdgpu: GPU reset end with ret = -62
Sep 18 13:11:55 masysma-18 kernel: [ 2065.186711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=3181, emitted seq=3181
Sep 18 13:11:55 masysma-18 kernel: [ 2065.186910] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 18 13:11:55 masysma-18 kernel: [ 2065.186919] amdgpu 0000:67:00.0: amdgpu: GPU reset begin!

I waded through <https://gitlab.freedesktop.org/drm/amd/-/issues/892> to gather ideas about how to fix it. Most of the "solutions" seemed to be very hacky though and I did not try them thoroghly.

This is what I noted from installing the proprietary driver:

https://www.amd.com/en/support/professional-graphics/radeon-pro/radeon-pro-w5000-series/radeon-pro-w5500
./amdgpu-pro-install
rm /etc/X11/xorg.conf

Also, I noted I had had a DP connectivity issue that was possibly responsible for the case where after the hang, the monitor would come back with a picture. It was fixed by re-attaching the DP cable...

HTH and YMMV
Linux-Fan

öö

Attachment: pgpRDh2ph6jLx.pgp
Description: PGP signature


Reply to: