[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Testing HIP with the amdgpu driver



Hello,

I wanted to compare the results of my previous tests using a Debian userland with an Ubuntu kernel and amdgpu-dkms [1] to one using a purely Debian userland and kernel. I picked up a spare SSD for my Radeon VII workstation, installed Debian Testing on it and built rocm-hipamd in a Debian Sid docker container. To do this, the only non-free software I needed was firmware-amd-graphics. The kernel version:

    cgmb@scorbunny:~$ uname -a
    Linux scorbunny 5.17.0-1-amd64 #1 SMP PREEMPT Debian 5.17.3-1 (2022-04-18) x86_64 GNU/Linux

The actual package building  and testing was done in docker with

    docker run -it --device=/dev/dri --device=/dev/kfd --security-opt seccomp=unconfined --group-add video debian:sid

It was somewhat disheartening that the results were so different using the built-in kernel support rather than the amdgpu-dkms module. When running the HIP test suite, it appeared as if some component of the graphics subsystem had crashed. My X session wasn't entirely destroyed, but it was certainly rendered unusable. The tests, however, did continue to run just as Étienne described.

I suspect that the problem is related to error handling when overloading the GPU work queues. When I logged in on a new tty, I could see from `top` that there were many HIP tests running. It would be nice to limit that, as scaling the parallelism based on `nproc` doesn't make much sense for GPU tests. The dmesg log supports the theory that the GPU is getting overloaded with work and something isn't handling it well. The first message after the hipamd test suite began was

    amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.

That said, it's just a guess. I limited docker to a single core when running the tests and it didn't prevent this error. I have not yet identified which test causes this log message or which test breaks my graphical session, but it shouldn't be that hard to find if I start running tests manually. I expect that it's the same test doing both.

In better news, I managed to build rocrand and its test suite using the Debian HIP packages. The rocRAND library passed all tests! There are a couple patches still needed for hipcc and hip-config.cmake in rocm-hipamd. I'll follow up with those in a new thread about build-time issues and leave this thread for runtime-related discussions.

[1]: https://lists.debian.org/debian-ai/2022/05/msg00025.html

Sincerely,
Cory Bloor


Reply to: