Hi Christian,
There was one unsettling issue: on Monday midday, I noticed that tests for these two architectures stopped updating on ci.rocm.debian.net, and when I came home from work, the GPU fans running at max speed.
I've seen this with many different GPUs. If the card is somehow put into a bad state, it may get stuck in a high power mode. This is one of the reasons why we have the health checks and timeouts on our tests. With that said, it would be good to continue to enhance the timeouts, so that a lack of forward progress in the test suite causes timeout sooner and the system can reboot itself to recover after failing the health check.
While I do support you in striving towards understanding and
addressing these failures, there are many ways a GPU can be put
into a bad state where a power cycle is the only fix. That goes
double when initializing the GPU in unsupported ways for PCIe
passthrough. We might be able to improve the reliability of
using passthrough, but we will always have to be prepared for
this sort of failure mode.
gfx1102 also completed all its tests on ci-test.rocm.debian.net, but because of the issue above, and the W7500 being only passively cooled, I'm not going to move it to endeavour just yet.
The Radeon PRO W7500 has a fan in its picture on the AMD product
page [1] and in reviews [2]. Yours doesn't have one?
Sincerely,
Cory Bloor
[1]:
https://www.amd.com/en/products/graphics/workstations/radeon-pro/w7500.html
[2]: https://www.pcmag.com/reviews/amd-radeon-pro-w7500