[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

First experiments with gfx1100/gfx1101/gfx1102



Hi all,

I ran the first experiments with gfx1100 (via W7800), gfx1101 (via
W7600), gfx1102 (via W7500) over the last few days.

The good
========

gfx1100 seems to work pretty well. If you look at the failing tests in
unstable [1], we have
  * rocprim: fixed in experimental
  * rocfft: fixed in experimental
  * rocsolver, hipsolver: substantial failures though other arches pass.
    perhaps these would benefit from an update to 5.7? (These are two of
    the few libraries that are < 5.7)
  * hipcub: fails but so does everything else non-gfx1030
  * rocsparse, hipsparse: fails everywhere
  * rocthrust: fails everywhere
  * rocblas, hipblas: interesting one. Pass everywhere else except for
    gfx1035 on unstable (testing still fine)

(We can ignore testing and earlier, as they don't support gfx110n yet.)

The not so good
===============

gfx1100 logs a concerning message to dmesg

> [drm:amdgpu_ras_eeprom_init [amdgpu]] *ERROR* Failed to read EEPROM table header, res:-5

but seems to work fine otherwise.

It does not work within QEMU, though. I haven't investigated why yet,
and there have been dependency updates since I first tried this.

The bad
=======

gfx1101 fails to load firmware, gfx1102 loads firmware but tests
basically insta-fail, eg test_rocrand_basic:

> HSA exception: Queue create failed at hsaKmtCreateQueue

with a ton of messages logged to dmesg:

> [ 1808.242470] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.242614] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.373242] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.373386] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.501814] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.501946] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.630382] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.630515] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait

I then looked at firmware-amd-graphics and noticed that
src:linux-firmware is pretty old, 2023-06-25.

This all but surely is a factor in at least gfx1101, gfx1102 which were
released later, and there is a bug report for this [2].

I think updating src:linux-firmware might be stalled because of [3].
I'll look into this, and see if I cannot provided an updated version in
at least our archive.

Sadly, I see the same failures above even when grabbing the Feb 2024
firmware straight from [4]. So my next attempt will be to try AMD's
upstream repo for newer kernels and firmware, perhaps fixes haven't
trickled down yet.

Best,
Christian

[1] https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1100
[2] https://bugs.debian.org/1052714
[3] https://bugs.debian.org/1061321
[4] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/


Reply to: