[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Preparing Cassiopeia and Bootes for the CI



Hello everyone,

I've added two new servers as workers for the Debian ROCm CI. The tl;dr is that there is now automated testing for gfx1031 and gfx1032 hardware and the failures are indicative of actual problems [1][2].

Cassiopeia is an AMD EPYC 7302P (Zen 2, 16 cores @3.0Ghz) machine with 128 GB of RAM and a 1000W PSU. It is currently outfitted with 1x Radeon Pro W5700 (Navi 10; gfx1010), 1x Radeon Pro W5500 (Navi 14; gfx1012), 1x Radeon Pro W6600 (Navi 23; gfx1032) and 1x Radeon Pro WX 7100 (Ellesmere / Polaris 10; gfx803). The Radeon Pro WX 7100 will be replaced with an XFX BC-160 (Navi 12; gfx1011), but for now I need that card to test a driver fix [3]. Cassiopeia draws about 150 W at idle.

Bootes is also an AMD EPYC 7302P (Zen 2, 16 cores @3.0Ghz) machine with 128 GB of RAM, but it has a 1600W PSU. It can fit a larger power supply as it is a 4U server, rather than a 3U like Cassiopeia. It is currently outfitted with 1x MI60 (Vega 20; gfx906), 1x MI100 (Arcturus; gfx908), and 1x RX 6700 XT (Navi 22; gfx1031). Those GPUs will probably move around as I bring more servers online.

I've included some photos of these systems on my website [4][5][6]. There was a Saphire RX 5700 XT (Navi 10; gfx1010) in Bootes when I took the photo, but it has since been removed. I'd believed the tight packing to be acceptable because the average GPU utilization while running tests is less than 1%. On my open air test bench, the GPU fans rarely even spin. The heatsink fins were aligned with the case airflow and I expected that to be enough. However, I found that if a test caused the driver to crash, the GPU may get stuck using 100% power indefinitely. I will add a fourth GPU to Bootes, but it will be a workstation or server GPU with a more suitable cooling solution for a tight packing.

Christian and I have been debugging issues with PCIe reset, which only seems to be working properly for gfx900 and gfx103x GPUs. So, gfx1031 and gfx1032 can use qemu workers, but we will probably be setting up LXC-based workers to get the rest of the architectures enabled. Using container-based workers will limit our ability to detect driver-related problems, so moving all workers to qemu is a long-term goal. For the moment, only gfx1031 and gfx1032 workers are enabled.

Cassiopeia and Bootes were funded by the Debian project, which paid for everything aside from the GPUs. AMD provided the Radeon Pro W6600 and I provided the rest of the GPUs.

Sincerely,
Cory Bloor

[1]: https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1031
[2]: https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1032
[3]: https://gitlab.freedesktop.org/drm/amd/-/issues/2956
[4]: https://slerp.xyz/img/misc/cassiopeia-open.jpg
[5]: https://slerp.xyz/img/misc/argo-lyra-cassiopeia.jpg
[6]: https://slerp.xyz/img/misc/bootes-open.jpg


Reply to: