Testing HIP with the amdgpu driver
Hello,
I wanted to compare the results of my previous tests using a Debian
userland with an Ubuntu kernel and amdgpu-dkms [1] to one using a purely
Debian userland and kernel. I picked up a spare SSD for my Radeon VII
workstation, installed Debian Testing on it and built rocm-hipamd in a
Debian Sid docker container. To do this, the only non-free software I
needed was firmware-amd-graphics. The kernel version:
cgmb@scorbunny:~$ uname -a
Linux scorbunny 5.17.0-1-amd64 #1 SMP PREEMPT Debian 5.17.3-1
(2022-04-18) x86_64 GNU/Linux
The actual package building and testing was done in docker with
docker run -it --device=/dev/dri --device=/dev/kfd --security-opt
seccomp=unconfined --group-add video debian:sid
It was somewhat disheartening that the results were so different using
the built-in kernel support rather than the amdgpu-dkms module. When
running the HIP test suite, it appeared as if some component of the
graphics subsystem had crashed. My X session wasn't entirely destroyed,
but it was certainly rendered unusable. The tests, however, did continue
to run just as Étienne described.
I suspect that the problem is related to error handling when overloading
the GPU work queues. When I logged in on a new tty, I could see from
`top` that there were many HIP tests running. It would be nice to limit
that, as scaling the parallelism based on `nproc` doesn't make much
sense for GPU tests. The dmesg log supports the theory that the GPU is
getting overloaded with work and something isn't handling it well. The
first message after the hipamd test suite began was
amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm
performance.
That said, it's just a guess. I limited docker to a single core when
running the tests and it didn't prevent this error. I have not yet
identified which test causes this log message or which test breaks my
graphical session, but it shouldn't be that hard to find if I start
running tests manually. I expect that it's the same test doing both.
In better news, I managed to build rocrand and its test suite using the
Debian HIP packages. The rocRAND library passed all tests! There are a
couple patches still needed for hipcc and hip-config.cmake in
rocm-hipamd. I'll follow up with those in a new thread about build-time
issues and leave this thread for runtime-related discussions.
[1]: https://lists.debian.org/debian-ai/2022/05/msg00025.html
Sincerely,
Cory Bloor
Reply to: