[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ROCm CI workers down due to extreme cold



Hi folks,

The emergency is over and I've brought the CI workers back online.

On 2024-01-13 19:45, Cordell Bloor wrote:
It's rather cold in Calgary at the moment. The forecast calls for a low of -38°C tonight (or -49°C after adjusting for windchill). The government of Alberta has issued an emergency alert, requesting all residents to reduce usage of electric power because the grid is nearing its limits due to reduced capacity and increased demand. I've shut down all the workers that I host for for the Debian ROCm CI until such time that they can be safely re-enabled.

Ultimately, it only ever reached -36°C in Calgary, although that is the coldest recorded temperature since 1997. It was much colder in other nearby communities. Edmonton reached -46°C, which was the lowest recorded temperature since they set up the weather station in 1959. There's a great picture of a glass of propane (boiling point: -42°C) on Reddit [1].

In any case, I've turned Lyra and Bootes back on. I've also consolidated the GPU that are working with the qemu+rocm backend, so Bootes is now hosting gfx1031, gfx1032 and gfx1034. The GPUs that are not working have been removed and Cassiopeia remains shut down. These changes both significantly reduce the idle power draw of the workers I'm hosting.

I'll be using Cassiopia for some testing to try to see if I can get resizable BAR enabled [2] and debug issues with PCIe pass-through so we can enable more GPUs on the qemu+rocm backend.

Sincerely,
Cory Bloor

[1]: https://www.reddit.com/r/pics/comments/198q63g/liquid_propane_in_alberta_at_atmospheric_pressure/ [2]: The motherboard is an ROMED8-2T/BCM, if anyone has experience with that.


Reply to: