[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Towards testing all Navi 3x architectures



Hi Christian,

On 2024-09-24 15:31, Christian Kastner wrote:
There was one unsettling issue: on Monday midday, I noticed that tests
for these two architectures stopped updating on ci.rocm.debian.net, and
when I came home from work, the GPU fans running at max speed.

I've seen this with many different GPUs. If the card is somehow put into a bad state, it may get stuck in a high power mode. This is one of the reasons why we have the health checks and timeouts on our tests. With that said, it would be good to continue to enhance the timeouts, so that a lack of forward progress in the test suite causes timeout sooner and the system can reboot itself to recover after failing the health check.

While I do support you in striving towards understanding and addressing these failures, there are many ways a GPU can be put into a bad state where a power cycle is the only fix. That goes double when initializing the GPU in unsupported ways for PCIe passthrough. We might be able to improve the reliability of using passthrough, but we will always have to be prepared for this sort of failure mode.

gfx1102 also completed all its tests on ci-test.rocm.debian.net, but
because of the issue above, and the W7500 being only passively cooled,
I'm not going to move it to endeavour just yet.

The Radeon PRO W7500 has a fan in its picture on the AMD product page [1] and in reviews [2]. Yours doesn't have one?

Sincerely,
Cory Bloor

[1]: https://www.amd.com/en/products/graphics/workstations/radeon-pro/w7500.html
[2]: https://www.pcmag.com/reviews/amd-radeon-pro-w7500


Reply to: