I have badly misdiagnosed this problem.
The default alarm timeout for tests is 500 seconds. There is a 5 second alarm signal used for the fancy multithreaded logging system in rocblas-test, but that's not the alarm that was triggered. The logs clearly show the test was running for ~500 seconds.The rocblas-test executable sets a five-second alarm signal before it executes some tests. If the alarm goes off before the test completes, rocblas-test will abort, under the assumption that there was deadlock that prevented the test from completing.
On slow hosts, such as lyra.rocm.debian.net, the timeout set for the alarm is insufficient to complete the test even when everything is functioning normally. This problem can be observed in the test logs for amd64+gfx900 [1].
The problem wasn't observed on my MI25 test system a few months ago [2]. However, I was wrong in believing this discrepancy was because Lyra is slow. When I ran the tests manually on Lyra in a qemu container, I observed the exact same behaviour, but could see that there was an amdgpu driver timeout that caused a GPU reset. This occurred at exactly the same point in the test suite as on the CI.
My question is now whether this is specific to Lyra or if it
applies to all systems with Vega 10 GPUs.
There is no single value that would be appropriate for the alarm timeout on every machine, so the timeout should either be configurable at runtime or entirely removed from the rocblas-test utility.
The timeout can be configured by setting the environment variable ROCBLAS_TEST_TIMEOUT=<seconds> or disabled by setting ROCBLAS_TEST_TIMEOUT=0.
[2]: https://slerp.xyz/rocm/logs/full/2023-08-22-gfx900.logSincerely, Cory Bloor [1]: https://ci.rocm.debian.net/data/autopkgtest/testing/amd64+gfx900/r/rocblas/913/log.gz