[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ROCm GPU_TARGETS and GPU_ARCHS and some other points



Hi Christian,

On 2025-10-16 07:10, Christian BAYLE wrote:
I'm currently working on composable-kernel [1] package and have some questions about GPU_TARGETS and GPU_ARCHS that are used to build the libraries [....] I could do several build on a per arch base, which has the good property to build tests and examples, but create conflicting per arch packages

On the other hand the build for all arch takes sometime more than 40Gb/core than will be difficult to run on autobuilders

which set to support ? GPU_ARCHS  or GPU_TARGETS ?

are there other packages concerned ? and how do you think it would be best to deal with this ?

I'm afraid I don't have good answers for you. This may be a case where we just try to put something that we think makes sense into the team repo or into experimental, and rework it based on what we discover trying to integrate it into other libraries.

CK is a key library, but I know very little about it aside from the fact that it is not going to be easy to build. I also fear that different CK reverse dependencies may be picky about what version of CK they require. This is just something that we're going to have to learn as we start trying to make use of it.

Other question, would amd-clang improve memory issues
I noticed that debian clang has no support for parallel jobs

-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS
-- Performing Test HIP_CLANG_SUPPORTS_PARALLEL_JOBS - Failed

I've seen that ubuntu  llvm-toolchain-rocm package [2] builds clang-rocm
Would composable kernel a good test case to test improvments ?

We discussed this offline, but I would like to answer your question on-list for posterity. You asked, "would amd-clang improve memory issues?" The answer is no. The `-parallel-jobs=N` flag allows clang to run N child processes in parallel when compiling a translation unit for multiple GPU architectures rather than building the unit for each GPU architecture in serial. This flag can be useful, but it actually increases peak memory usage.

On a related note, I recently learned that there is an upstream alternative to the parallel jobs flag. I'm not sure if LLVM 21 is new enough, but you might be able to use `--new-offload-driver --offload-jobs=N` to achieve a similar effect with upstream clang [3]. Sam Liu, an AMD LLVM developer, described it as such:

> About out-of-tree status of -parallel-jobs, currently there is an alternative option to it called --offload-jobs=N which is in trunk but only available for HIP under --new-offload-driver option. Currently --new-offload-driver is experimental but should work for most HIP apps. The plan is to gradually transition to this new driver since it eventually supports interoperability with OpenMP offloading.

Sincerely,
Cory Bloor

[1] https://github.com/ROCm/composable_kernel
[2] https://launchpad.net/~bullwinkle-team/+archive/ubuntu/rocm-devel
https://gitlab.kitware.com/cmake/cmake/-/issues/26997


Reply to: