Three discussion questions on rocm-target-arch
I've been thinking about rocm-target-arch.
(1) I was wondering what happens with rocm-target-arch in downstream
distributions like Mint, Pop_OS, etc. When Linux Mint 23 rolls around, I
assume that ROCm will fail to build on that platform?
The current Linux Mint 22 codename is Zara, so if we were using
rocm-target-arch at the start of last year, then I think their ROCm
packages would probably see there's no
/usr/share/pkg-rocm-tools/data/build-targets/zara data file and error
out. It's totally reasonable to expect downstream packagers to put in a
bit of work to maintain a distribution, but I worry a bit about breaking
ROCm packages in every downstream distro by default. There will always
be some portion of downstream maintainers that don't fix them and it's
the users that will lose out.
I'm also pondering over how we're updating packages to the new target
list. In our planning for enabling gfx1201 on unstable, we're uploading
the tooling to build for gfx1201 to unstable, then updating
pkg-rocm-tools to build for gfx1201, then uploading our new packages.
(2) The packages *must* be uploaded in sequence from the bottom of the
dependency tree to the top --- even if there were no API or ABI changes
in some of the lower-level packages --- because rocm-target-arch
defaults to "reduce" mode, which drops any targets that are not found in
all build dependencies. This ordering requirement will also apply if we
ever change the rocm-target-arch list and request binNMUs.
I think the main benefit of reduce mode is that if a package includes
some value in the X-ROCm-GPU-Architecture field, then probably all
dependencies also include that value. This would mostly only be false if
a dependency dropped support or if package was built against a
dependency on unstable and migrated to testing before the dependency
did. The biggest cost is that it's more difficult to reason about. You
must know the current state of all your GPU-enhanced B-Ds on buildd to
know which targets your package will build for after upload. There are
also a number of tricky cases where this behaviour is incorrect and I
think they may be more common than expected (e.g., B-D uses SPIR-V,
generic targets, HIP RTC, or calls to the B-D are guarded by conditionals)..
(3) It's a bit annoying that we can't update rocm-target-arch so that it
is ready when new ROCm versions land in unstable. I sort of wonder if
the distribution is the wrong thing to be checking. The compiler version
might be an interesting alternative. You could update pkg-rocm-tools on
unstable right now, saying which targets would be enabled for LLVM 21,
but as long as unstable is using LLVM 17, it would continue using the
LLVM 17 targets. That would sidestep problem (1) as well, as it would be
robust against changes to the distribution name.
I suppose there's always the possibility that two distributions want to
use the same version of LLVM but have different build targets. There are
few possibilities there. They could have different versions of
pkg-rocm-tools. Or, we could have an optional qualifier for the
distribution. We can probably solve this one on the fly if it ever happens.
That was a bit long, so I'll end with a brief summary. I welcome your
thoughts on these questions:
(1) Maybe rocm-target-arch could print a warning and default to sid if
it can't identify the distribution?
(2) We should consider defaulting to --no-reduce. It is more similar to
what we're used to when setting targets manually. It's a very impressive
feature, but it might be a bit too much all at once?
(3) Perhaps we should consider basing the target architectures on the
compiler version, rather than the target distribution?
Sincerely,
Cory Bloor
Reply to: