[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Three discussion questions on rocm-target-arch



I've been thinking about rocm-target-arch.

(1) I was wondering what happens with rocm-target-arch in downstream distributions like Mint, Pop_OS, etc. When Linux Mint 23 rolls around, I assume that ROCm will fail to build on that platform?

The current Linux Mint 22 codename is Zara, so if we were using rocm-target-arch at the start of last year, then I think their ROCm packages would probably see there's no /usr/share/pkg-rocm-tools/data/build-targets/zara data file and error out. It's totally reasonable to expect downstream packagers to put in a bit of work to maintain a distribution, but I worry a bit about breaking ROCm packages in every downstream distro by default. There will always be some portion of downstream maintainers that don't fix them and it's the users that will lose out.

I'm also pondering over how we're updating packages to the new target list. In our planning for enabling gfx1201 on unstable, we're uploading the tooling to build for gfx1201 to unstable, then updating pkg-rocm-tools to build for gfx1201, then uploading our new packages.

(2) The packages *must* be uploaded in sequence from the bottom of the dependency tree to the top --- even if there were no API or ABI changes in some of the lower-level packages --- because rocm-target-arch defaults to "reduce" mode, which drops any targets that are not found in all build dependencies. This ordering requirement will also apply if we ever change the rocm-target-arch list and request binNMUs.

I think the main benefit of reduce mode is that if a package includes some value in the X-ROCm-GPU-Architecture field, then probably all dependencies also include that value. This would mostly only be false if a dependency dropped support or if package was built against a dependency on unstable and migrated to testing before the dependency did. The biggest cost is that it's more difficult to reason about. You must know the current state of all your GPU-enhanced B-Ds on buildd to know which targets your package will build for after upload. There are also a number of tricky cases where this behaviour is incorrect and I think they may be more common than expected (e.g., B-D uses SPIR-V, generic targets, HIP RTC, or calls to the B-D are guarded by conditionals)..

(3) It's a bit annoying that we can't update rocm-target-arch so that it is ready when new ROCm versions land in unstable. I sort of wonder if the distribution is the wrong thing to be checking. The compiler version might be an interesting alternative. You could update pkg-rocm-tools on unstable right now, saying which targets would be enabled for LLVM 21, but as long as unstable is using LLVM 17, it would continue using the LLVM 17 targets. That would sidestep problem (1) as well, as it would be robust against changes to the distribution name.

I suppose there's always the possibility that two distributions want to use the same version of LLVM but have different build targets. There are few possibilities there. They could have different versions of pkg-rocm-tools. Or, we could have an optional qualifier for the distribution. We can probably solve this one on the fly if it ever happens.


That was a bit long, so I'll end with a brief summary. I welcome your thoughts on these questions:

(1) Maybe rocm-target-arch could print a warning and default to sid if it can't identify the distribution?

(2) We should consider defaulting to --no-reduce. It is more similar to what we're used to when setting targets manually. It's a very impressive feature, but it might be a bit too much all at once?

(3) Perhaps we should consider basing the target architectures on the compiler version, rather than the target distribution?

Sincerely,
Cory Bloor


Reply to: