Re: ROCm GPU_TARGETS and GPU_ARCHS and some other points
Hi Christian,
On 2025-10-16 07:10, Christian BAYLE wrote:
I'm currently working on composable-kernel [1] package and have some
questions about GPU_TARGETS and GPU_ARCHS that are used to build the
libraries [....]
I could do several build on a per arch base, which has the good
property to build tests and examples, but create conflicting per arch
packages
On the other hand the build for all arch takes sometime more than
40Gb/core than will be difficult to run on autobuilders
which set to support ? GPU_ARCHS or GPU_TARGETS ?
are there other packages concerned ? and how do you think it would be
best to deal with this ?
I'm afraid I don't have good answers for you. This may be a case where
we just try to put something that we think makes sense into the team
repo or into experimental, and rework it based on what we discover
trying to integrate it into other libraries.
CK is a key library, but I know very little about it aside from the fact
that it is not going to be easy to build. I also fear that different CK
reverse dependencies may be picky about what version of CK they require.
This is just something that we're going to have to learn as we start
trying to make use of it.
Other question, would amd-clang improve memory issues
I noticed that debian clang has no support for parallel jobs
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed
I've seen that ubuntu llvm-toolchain-rocm package [2] builds clang-rocm
Would composable kernel a good test case to test improvments ?
We discussed this offline, but I would like to answer your question
on-list for posterity. You asked, "would amd-clang improve memory
issues?" The answer is no. The `-parallel-jobs=N` flag allows clang to
run N child processes in parallel when compiling a translation unit for
multiple GPU architectures rather than building the unit for each GPU
architecture in serial. This flag can be useful, but it actually
increases peak memory usage.
On a related note, I recently learned that there is an upstream
alternative to the parallel jobs flag. I'm not sure if LLVM 21 is new
enough, but you might be able to use `--new-offload-driver
--offload-jobs=N` to achieve a similar effect with upstream clang [3].
Sam Liu, an AMD LLVM developer, described it as such:
> About out-of-tree status of -parallel-jobs, currently there is an
alternative option to it called --offload-jobs=N which is in trunk but
only available for HIP under --new-offload-driver option. Currently
--new-offload-driver is experimental but should work for most HIP apps.
The plan is to gradually transition to this new driver since it
eventually supports interoperability with OpenMP offloading.
Sincerely,
Cory Bloor
[1] https://github.com/ROCm/composable_kernel
[2] https://launchpad.net/~bullwinkle-team/+archive/ubuntu/rocm-devel
https://gitlab.kitware.com/cmake/cmake/-/issues/26997
Reply to: