Re: ROCm CI workers down due to extreme cold
Hi folks,
The emergency is over and I've brought the CI workers back online.
On 2024-01-13 19:45, Cordell Bloor wrote:
It's rather cold in Calgary at the moment. The forecast calls for a
low of -38°C tonight (or -49°C after adjusting for windchill). The
government of Alberta has issued an emergency alert, requesting all
residents to reduce usage of electric power because the grid is
nearing its limits due to reduced capacity and increased demand. I've
shut down all the workers that I host for for the Debian ROCm CI until
such time that they can be safely re-enabled.
Ultimately, it only ever reached -36°C in Calgary, although that is the
coldest recorded temperature since 1997. It was much colder in other
nearby communities. Edmonton reached -46°C, which was the lowest
recorded temperature since they set up the weather station in 1959.
There's a great picture of a glass of propane (boiling point: -42°C) on
Reddit [1].
In any case, I've turned Lyra and Bootes back on. I've also consolidated
the GPU that are working with the qemu+rocm backend, so Bootes is now
hosting gfx1031, gfx1032 and gfx1034. The GPUs that are not working have
been removed and Cassiopeia remains shut down. These changes both
significantly reduce the idle power draw of the workers I'm hosting.
I'll be using Cassiopia for some testing to try to see if I can get
resizable BAR enabled [2] and debug issues with PCIe pass-through so we
can enable more GPUs on the qemu+rocm backend.
Sincerely,
Cory Bloor
[1]:
https://www.reddit.com/r/pics/comments/198q63g/liquid_propane_in_alberta_at_atmospheric_pressure/
[2]: The motherboard is an ROMED8-2T/BCM, if anyone has experience with
that.
Reply to: