Re: Debian ROCm CI troubles
Hi,
I understand the problem and I wanted to help earlier, but I need to prioritize other projects. Right now, my focus is on a new CI in which we can test using non-free models and libraries, so that anything that Debian ships and needs a model can have autopkgtests, too.
On 2025-08-11 01:52, Cordell Bloor wrote:
> There seems to be something wrong with head node for the ROCm Debian CI [1]. There have been many new uploads, but it doesn't seem to be running jobs for them. I'm also seeing an Internal Server Error when I try to manually request jobs. We would really benefit from having the CI available during the ROCm 5.7 -> 6.1 -> 6.4 and LLVM 17 -> 19/20 updates. I hate to ask anything more from you, but your expertise with this system is unmatched. Do you think you could give it a kick and get it working again?
The internal server error was because of expired certificates; an oversight when I last generated them. They were in the common pki/ folder but I didn't copy them over to the host's configuration. This is fixed, you can manually schedule tests now.
The other problem is not with debci but debci-scheduler. There are at least two distinct issues. One is that python-apt complains about a missing deb-src entry for stable-backports when it clearly exists. Another is the state of experimental. I cannot look into these right now, this will happen sometime during the week. Work is busy.
> If there are folks on this list that want to lend a hand but aren't sure how to help out with ROCm, then I would suggest that contributing to the DebCI would be greatly beneficial. Aside from fixing the bugs that cause the queues to stall, it would be nice to improve the user interface so that there is more information displayed directly on the website about what the DebCI head node is doing. I'd like to see information about the status of worker nodes, the state of the queues (e.g., jobs in progress), more results visible at a glance (e.g., percentage failed rather than just pass/fail), and a more useful main page. I think a lot of these improvements could be upstreamed into the official DebCI.
>
> We also need to increase the bus factor on the number of individuals with a solid understanding of the ROCm-enhanced DebCI system. Fixing bugs and adding features would be a great way to learn about it.
I'd love that; the reason I spent so much time in documenting [1, 2] is precisely so that I'm not a single point of failure.
But I do think you are underestimating just how much effort this takes. It's not just debci: it's also our autopkgtest fork with the QEMU and podman backends, our scheduler, pkg-rocm-tools, and gpuenv-utils. There is interplay between all of these and this often requires significant work.
And FWIW it's not like I chose to reduce the bus factor to 1 -- I just put hundreds of hours of work into it over the past three years. Hence my puzzlement about [3]. As agreed, I'm giving the new arrangement time, but did hope that it would relieve me of some of the work.
In any case, I do have updates planned, but they will take longer. I had some fantastic discussions at DebConf, with various key teams, surrounding GPU/accelerator adoption in Debian's official policies and tooling. Really fantastic. We have lots of people eager to see Debian be first-mover here. I did not expect so much enthusiasm.
>From these discussions, it became clear that many (if not all) of my custom solutions above will eventually be obsoleted by vendor-neutral ones, hence why I've been so reluctant to invest more time in them. I'll be working directly on upstream tools instead, and the ROCm CI (and the new "AI" CI) will eventually be able to use those. It'll just take some more time. But the tools will be better.
In the meantime, as AMD is already partnering with Canonical on ROCm, they seem to be the obvious choice for support. They have lots debci and autopkgtest experience, and the developers working on this are kind, helpful, and skilled. I believe I mentioned this at DebConf.
Best,
Christian
[1]: https://salsa.debian.org/rocm-team/rocm-team-infra/
[2]: https://salsa.debian.org/rocm-team/rocm-team-infra/-/blob/master/doc/Software-in-our-infra.md?ref_type=heads
[3]: https://lists.debian.org/debian-ai/2025/05/msg00100.html
Reply to: