[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Concerns about ROCm QA



Hi,

I would like to raise some concerns with the state of the ROCm stack.
The current state suggests that things are rushed towards 6.4/7.0,
forgoing our usual planning, and with insufficient QA.

For example:

 * The rocm-hipamd transition is still ongoing [1], despite it being a
   relatively small transition.

   Transitions should be kept as short as possible, and we have ways of
   ensuring this. We rebuild and test in experimental, and only when
   reverse dependencies have been fixed (within reason) do we execute
   the transition.

 * rocm-llvm is blocked by regressions and an RC bug [2].

 * rocr-runtime is blocked by regressions and an RC bug [3].

 * rocm-llvm and rocr-runtime additionally block rocm-hipamd.

 * During summer, I uploaded many 6.4 libraries to experimental, which
   all passed or mostly passed their tests. With the updates of the
   non-library components to 6.4, things started to break. Aligning
   everything to 6.4 seems to have fixed this, but the breakage suggests
   that at least some packages need either Breaks or Build-Depends
   changes. See these rocrand tests [4,5], for example.

 * Most importantly, our CI [6] results are not getting attention. We
   have numerous packages failing their tests. For officially supported
   GPU architectures, at least, these failures need to be triaged [7],
   reported as RC bugs, and migration to testing prevented until they
   are addressed, just as it would happen in the official CI.

One of the goals of building our CI was to demonstrate that ROCm on
Debian works, and with a variety of AMD GPUs. In the bootstrapping
phase, we accepted a certain degree of problems as a consequence of the
experimental nature of building this.

It's now been more than two years since launch, and the CI has
stabilized. If we still aren't doing anything with the results, then we
have failed the aforementioned goal. Worse, by not addressing errors, we
might even be making an argument against ROCm.

I haven't been active over the past few months because they have been
the busiest of my professional career. But observing passively,
breakages and failing tests suggest that a release is being prioritized
over quality. In practice, with my package ggml, the 6.4 updates put me
in a position where I need to drop HIP backend for the time being,
because ggml's migration to testing would blocked for too long, for the
issues itemized above.

I would help on this, but those of you who have been following along
over the past few years know how much time I already invested into the
CI, the packaging and much of our tooling, and it has become difficult
for me to justify investing more.

Half a year ago, AMD committed itself to in-box support on Ubuntu, in a
drive that was described as an "enormous win for Debian" [8]. Yet, over
this past half year, it seems that all advancements, including those on
Ubuntu, continued to have come from the Debian side, and much of that
from volunteers.

To get ROCm on Debian/Ubuntu where it needs to be, I believe we
need better coordination and QA again, and especially when
we actually see breakage. And if resources are a problem, then I think
AMD needs to follow through on its commitment to in-box support of ROCm.

Best,
Christian

PS: This criticism of this state is by no means a criticism of the work
that the contributors have been doing. On the contrary. The fact that
we've gotten Debian and Ubuntu so far is a glowing testament to their
work and commitment. But we can only do so much.

[1]: https://release.debian.org/transitions/html/auto-rocm-hipamd.html
[2]: https://tracker.debian.org/pkg/rocm-llvm
[3]: https://tracker.debian.org/pkg/rocr-runtime
[4]: https://ci.rocm.debian.net/packages/r/rocrand/unstable/amd64+gfx1030/80767/
[5]: https://ci.rocm.debian.net/packages/r/rocrand/unstable/amd64+gfx1030/103533/
[6]: https://ci.rocm.debian.net/
[7]: https://lists.debian.org/debian-ai/2025/08/msg00137.html
[8]: https://lists.debian.org/debian-ai/2025/05/msg00100.html


Reply to: