[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Three discussion questions on rocm-target-arch



Hi Cory,

On 2025-09-29 10:25, Cordell Bloor wrote:
> (1) Maybe rocm-target-arch could print a warning and default to sid if
> it can't identify the distribution?

This makes sense. My concern was that we cannot know what downstreams
want to support, but that doesn't mean we can't ship a reasonable default.

I've pushed a change that implements this.

> (2) We should consider defaulting to --no-reduce. It is more similar to
> what we're used to when setting targets manually. It's a very impressive
> feature, but it might be a bit too much all at once?

As per to codesearch.debian.net, nothing was using this yet, so I pushed
this change as well.

> (3) It's a bit annoying that we can't update rocm-target-arch so that it is ready when new ROCm versions land in unstable. I sort of wonder if the distribution is the wrong thing to be checking. The compiler version might be an interesting alternative. You could update pkg-rocm-tools on unstable right now, saying which targets would be enabled for LLVM 21, but as long as unstable is using LLVM 17, it would continue using the LLVM 17 targets. That would sidestep problem (1) as well, as it would be robust against changes to the distribution name.
> 
> I suppose there's always the possibility that two distributions want to use the same version of LLVM but have different build targets. There are few possibilities there. They could have different versions of pkg-rocm-tools. Or, we could have an optional qualifier for the distribution. We can probably solve this one on the fly if it ever happens. 

> (3) Perhaps we should consider basing the target architectures on the
> compiler version, rather than the target distribution?

This is a more complex problem. I still think going the distribution
way is the correct one, because it is unambiguous, and because there is
a simple and practical strategy to work with it.

We don't really need to check what the default compiler is; we just need
to update pkg-rocm-tools when the default compiler changes.

We also don't need to touch packages in order to add new architectures.
We just grep the Packages files for X-Rocm-Gpu-Architecture [1, 2] and
trigger binNMUs for all packages that need an update. This is a
zero-change, zero-upload, zero-interaction task, even for packages
maintained
by others, that can be done in a few minutes.

This is also safe, in that we would enable support for new architectures
only after all of the default tooling supports them.


Case in counter-point: we recently added gfx1201 to the supported
architecture list. I didn't think this would be an issue, but rocrand
now FTBFS in experimental because of it:

> clang++-17: error: invalid target ID 'gfx1201'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')
> clang++-17clang++-17clang++-17: : : error: error: error: invalid target ID 'gfx1201'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')invalid target ID 'gfx1201'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')invalid target ID 'gfx1201'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')

The build works with hipcc 6.4, but that is not what the package depends
on, so the default 5.7 is used.

This package would FTBFS on a buildd. This will only get resolved once
hipcc 6.4, which depends on LLVM 20, enters unstable.

Until that happens, I think we need to revert gfx1201 support from
experimental.

Best,
Christian

[1]: Unfortunately dpkg-genchanges case-mangles the variable

[2]: I only now see that very few packages have this field set, even
     though I added it to many.

     I can't reproduce this right now. when I rebuild rocrand, it gets
     set correctly.

     It may be the case that some ~exp version of rocm-target-arch,
     which I used back in summer to build these packages, was still
     buggy, in which case the issue should be resolved on next uploads.

     It worked fine on recent ggml, for example:

> root@d8ef63823590:/# apt-cache show libggml0-backend-hip | grep -i x-rocm
> X-Rocm-Gpu-Architecture: gfx803 gfx900 gfx906 gfx908 gfx90a gfx1010 gfx1030 gfx1100 gfx1101 gfx1102


Reply to: