Preparing Cassiopeia and Bootes for the CI
Hello everyone,
I've added two new servers as workers for the Debian ROCm CI. The tl;dr
is that there is now automated testing for gfx1031 and gfx1032 hardware
and the failures are indicative of actual problems [1][2].
Cassiopeia is an AMD EPYC 7302P (Zen 2, 16 cores @3.0Ghz) machine with
128 GB of RAM and a 1000W PSU. It is currently outfitted with 1x Radeon
Pro W5700 (Navi 10; gfx1010), 1x Radeon Pro W5500 (Navi 14; gfx1012), 1x
Radeon Pro W6600 (Navi 23; gfx1032) and 1x Radeon Pro WX 7100 (Ellesmere
/ Polaris 10; gfx803). The Radeon Pro WX 7100 will be replaced with an
XFX BC-160 (Navi 12; gfx1011), but for now I need that card to test a
driver fix [3]. Cassiopeia draws about 150 W at idle.
Bootes is also an AMD EPYC 7302P (Zen 2, 16 cores @3.0Ghz) machine with
128 GB of RAM, but it has a 1600W PSU. It can fit a larger power supply
as it is a 4U server, rather than a 3U like Cassiopeia. It is currently
outfitted with 1x MI60 (Vega 20; gfx906), 1x MI100 (Arcturus; gfx908),
and 1x RX 6700 XT (Navi 22; gfx1031). Those GPUs will probably move
around as I bring more servers online.
I've included some photos of these systems on my website [4][5][6].
There was a Saphire RX 5700 XT (Navi 10; gfx1010) in Bootes when I took
the photo, but it has since been removed. I'd believed the tight packing
to be acceptable because the average GPU utilization while running tests
is less than 1%. On my open air test bench, the GPU fans rarely even
spin. The heatsink fins were aligned with the case airflow and I
expected that to be enough. However, I found that if a test caused the
driver to crash, the GPU may get stuck using 100% power indefinitely. I
will add a fourth GPU to Bootes, but it will be a workstation or server
GPU with a more suitable cooling solution for a tight packing.
Christian and I have been debugging issues with PCIe reset, which only
seems to be working properly for gfx900 and gfx103x GPUs. So, gfx1031
and gfx1032 can use qemu workers, but we will probably be setting up
LXC-based workers to get the rest of the architectures enabled. Using
container-based workers will limit our ability to detect driver-related
problems, so moving all workers to qemu is a long-term goal. For the
moment, only gfx1031 and gfx1032 workers are enabled.
Cassiopeia and Bootes were funded by the Debian project, which paid for
everything aside from the GPUs. AMD provided the Radeon Pro W6600 and I
provided the rest of the GPUs.
Sincerely,
Cory Bloor
[1]: https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1031
[2]: https://ci.rocm.debian.net/status/failing/?arch%5B%5D=amd64%2Bgfx1032
[3]: https://gitlab.freedesktop.org/drm/amd/-/issues/2956
[4]: https://slerp.xyz/img/misc/cassiopeia-open.jpg
[5]: https://slerp.xyz/img/misc/argo-lyra-cassiopeia.jpg
[6]: https://slerp.xyz/img/misc/bootes-open.jpg
Reply to: