First experiments with gfx1100/gfx1101/gfx1102
Hi all,
I ran the first experiments with gfx1100 (via W7800), gfx1101 (via
W7600), gfx1102 (via W7500) over the last few days.
The good
========
gfx1100 seems to work pretty well. If you look at the failing tests in
unstable [1], we have
* rocprim: fixed in experimental
* rocfft: fixed in experimental
* rocsolver, hipsolver: substantial failures though other arches pass.
perhaps these would benefit from an update to 5.7? (These are two of
the few libraries that are < 5.7)
* hipcub: fails but so does everything else non-gfx1030
* rocsparse, hipsparse: fails everywhere
* rocthrust: fails everywhere
* rocblas, hipblas: interesting one. Pass everywhere else except for
gfx1035 on unstable (testing still fine)
(We can ignore testing and earlier, as they don't support gfx110n yet.)
The not so good
===============
gfx1100 logs a concerning message to dmesg
> [drm:amdgpu_ras_eeprom_init [amdgpu]] *ERROR* Failed to read EEPROM table header, res:-5
but seems to work fine otherwise.
It does not work within QEMU, though. I haven't investigated why yet,
and there have been dependency updates since I first tried this.
The bad
=======
gfx1101 fails to load firmware, gfx1102 loads firmware but tests
basically insta-fail, eg test_rocrand_basic:
> HSA exception: Queue create failed at hsaKmtCreateQueue
with a ton of messages logged to dmesg:
> [ 1808.242470] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.242614] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.373242] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.373386] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.501814] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.501946] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
> [ 1808.630382] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
> [ 1808.630515] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
I then looked at firmware-amd-graphics and noticed that
src:linux-firmware is pretty old, 2023-06-25.
This all but surely is a factor in at least gfx1101, gfx1102 which were
released later, and there is a bug report for this [2].
I think updating src:linux-firmware might be stalled because of [3].
I'll look into this, and see if I cannot provided an updated version in
at least our archive.
Sadly, I see the same failures above even when grabbing the Feb 2024
firmware straight from [4]. So my next attempt will be to try AMD's
upstream repo for newer kernels and firmware, perhaps fixes haven't
trickled down yet.
Best,
Christian
[1] https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1100
[2] https://bugs.debian.org/1052714
[3] https://bugs.debian.org/1061321
[4] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
Reply to: