[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Testing HIP with the amdgpu driver



Good news everyone,

On 2022-05-23 16:08, Cordell Bloor wrote:
I suspect that the problem is related to error handling when overloading the GPU work queues. [...] That said, it's just a guess. I limited docker to a single core when running the tests and it didn't prevent this error.

Overwhelming the GPU does seem to be the cause of most of the problems I experienced. Obviously, failing a stress test isn't a good thing. However, I'm glad to report that the actual functional test failures are not as bad as they first appeared.

I'd updated my kernel to that of Debian Sid, so I can't rule out the newer kernel being part of my improved results. However, I experienced the exact same graphics problems on the newer kernel when I used gbp buildpackage and let dh_autotest run the tests. It wasn't until I started using ctest directly that I got normal behaviour.

Kernel:
Linux scorbunny 5.17.0-2-amd64 #1 SMP PREEMPT Debian 5.17.6-1 (2022-05-14) x86_64 GNU/Linux

Command:
# ctest --output-on-failure -E "(hipMultiThreadDevice-pyramid|hipMemoryAllocateCoherentDriver|hipMultiProcIpcMem|hipIpcMemAccessTest|hipStreamSync2)"

Results:
99% tests passed, 3 tests failed out of 403

Total Test time (real) = 490.94 sec

The following tests did not run:
     96 - directed_tests/g++/hipMalloc_cxx_amd.tst (Skipped)

The following tests FAILED:
     99 - directed_tests/hiprtc/hiprtcGetLoweredName.tst (SEGFAULT)
    100 - directed_tests/hiprtc/saxpy.tst (SEGFAULT)
    361 - directed_tests/runtimeApi/multiThread/hipMultiThreadStreams2.tst (Subprocess aborted)
Errors while running CTest

Note that I excluded 5 tests in that ctest command, so it's really more like 400/408 tests passed. Still, the built-in amdgpu driver is a lot closer to par with the amdgpu-dkms module than it first appeared!

I have not yet identified which test causes this log message or which test breaks my graphical session, but it shouldn't be that hard to find if I start running tests manually.

I suspect it's hipStreamSync2 doing that, because it fails particularly hard. However, none of the tests break my graphical session when I run them one at a time.

Sincerely,
Cory Bloor


Reply to: