Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts

To: 1056171@bugs.debian.org
Subject: Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Sun, 19 Nov 2023 01:43:37 -0700
Message-id: <[🔎] f2304c95-9cba-45a7-98ae-abf9b77e593e@slerp.xyz>
Reply-to: Cordell Bloor <cgmb@slerp.xyz>, 1056171@bugs.debian.org, 1056171@bugs.debian.org
In-reply-to: <[🔎] 170029396605.4576.15920190241514720329.reportbug@8546943794cb>
References: <[🔎] 170029396605.4576.15920190241514720329.reportbug@8546943794cb> <[🔎] 170029396605.4576.15920190241514720329.reportbug@8546943794cb>

I have badly misdiagnosed this problem.

On 2023-11-18 00:52, Cordell Bloor wrote:

The rocblas-test executable sets a five-second alarm signal before it
executes some tests. If the alarm goes off before the test completes,
rocblas-test will abort, under the assumption that there was deadlock
that prevented the test from completing.

The default alarm timeout for tests is 500 seconds. There is a 5 second alarm signal used for the fancy multithreaded logging system in rocblas-test, but that's not the alarm that was triggered. The logs clearly show the test was running for ~500 seconds.

On slow hosts, such as lyra.rocm.debian.net, the timeout set for the
alarm is insufficient to complete the test even when everything is
functioning normally. This problem can be observed in the test logs for
amd64+gfx900 [1].

The problem wasn't observed on my MI25 test system a few months ago [2]. However, I was wrong in believing this discrepancy was because Lyra is slow. When I ran the tests manually on Lyra in a qemu container, I observed the exact same behaviour, but could see that there was an amdgpu driver timeout that caused a GPU reset. This occurred at exactly the same point in the test suite as on the CI.

My question is now whether this is specific to Lyra or if it applies to all systems with Vega 10 GPUs.

There is no single value that would be appropriate for the alarm timeout
on every machine, so the timeout should either be configurable at
runtime or entirely removed from the rocblas-test utility.

The timeout can be configured by setting the environment variable ROCBLAS_TEST_TIMEOUT=<seconds> or disabled by setting ROCBLAS_TEST_TIMEOUT=0.

Sincerely,
Cory Bloor

[1]: https://ci.rocm.debian.net/data/autopkgtest/testing/amd64+gfx900/r/rocblas/913/log.gz

[2]: https://slerp.xyz/rocm/logs/full/2023-08-22-gfx900.log

Reply to:

References:
- Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts
  - From: Cordell Bloor <cgmb@slerp.xyz>

Prev by Date: Preparing Cassiopeia and Bootes for the CI
Next by Date: ffcv is marked for autoremoval from testing
Previous by thread: Bug#1056171: librocblas0-tests: rocblas-test alarm timeout on slow hosts
Next by thread: Bug#1056172: librocprim-tests: Test failures when gfx1030 code is run on gfx1031 hardware
Index(es):
- Date
- Thread