[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: HIP ISA Compatibility Unlocked



Hi Gavin,

I've been reviewing some of your patches. Thank you very much for sharing them. They've been a great help in understanding the changes made to rocm-compilersupport and rocm-hipamd since ROCm 5.2.

On 2024-02-21 07:35, Gavin Zhao wrote:
Recently, I was able to patch ROCm-CompilerSupport, rocBLAS, MIOpen, and rocFFT so that combining with your patches to the HIP runtime, I can run Stable Diffusion (PyTorch) and llama.cpp on my RX6600M (gfx1032) without HSA_OVERRIDE_GFX_VERSION. Hopefully these efforts would help Debian and other distros to also get this compatibility support as well. My patches are uploaded to my personal GitHub fork GZGavinZhao/<component-name> with branch name solus-rocm-<version>. So, the patches for rocBLAS 6.0 are in the solus-rocm-6.0.0 branch of GZGavinZhao/rocBLAS.

Please note that these patches haven't been thoroughly tested and the ROCm 6.0 packages are not in Solus's repo yet, so please let me know if there are any issues. In addition, the patches may need to be adapted to ROCm 5.7 since ROCm 6.0 doesn't have the int gcnArch as a member of hipDeviceProp_t, instead it has the string gcnArchName, which complicates the arch coercion logic a bit. ROCm 6.0 also requires an additional patch to ROCm-CompilerSupport, so you may want to double check if 5.7 needs that too.

The unbundler function that you patched exists in rocm-compilersupport 5.7, but it is not used by rocm-hipamd in 5.7. At least, not by default. The HIP_USE_RUNTIME_UNBUNDLER setting in rocclr/utils/flags.hpp is what controls whether the HIP runtime uses its own runtime unbundler or the comgr-provided unbundler.

In ROCm 5.7, HIP_USE_RUNTIME_UNBUNDLER defaults to true, so HIP defaults to using its own unbundler (like in ROCm 5.2). In ROCm 6.0, HIP_USE_RUNTIME_UNBUNDLER defaults to false, so HIP defaults to using comgr.

My recommendation for Solus would be to build HIP in ROCm 6.0 with HIP_USE_RUNTIME_UNBUNDLER set to true, thereby avoiding the need to patch comgr. These patches will be rendered obsolete by the Generic ISAs in LLVM 18, so the fewer components patched, the better.

Lastly, I have a working PR for sccache that can cache HIP compilations. Combining with a patch for HIPCC I was able to greatly speed up my workflow by exporting the environment variable HIP_CLANG_LAUNCHER=sccache so that I can rapidly edit, clean, and rebuild. I'm not too familiar with how Debian packaging works, but if you've also been bothered by the long compilation times hopefully this can help.

AMD uses sccache and ccache extensively within the internal CI systems for the ROCm project. The real trick is in the cache invalidation. I'll try to see if I can share more information about this. I also think you should also consider submitting that patch for hipcc upstream. It may or may not get accepted, but it's a good feature (and AMD used a similar patch internally for a while).

Looking further into the future, I believe that the greatest gains would come from caching amdgpu bitcode. You might need some support from clang to interpose the appropriate calls, but the device code optimization passes are where the compiler spends all its time during most library builds. Most source code and compiler flag changes don't affect the bitcode generated for the device code, so it has a much higher cache hit rate than caching based on the source code. I built a working prototype of this back when the rocm clang compiler was calling the external `opt` and `llc` utilities (circa ROCm 3.8).

P.S. I'm not sure if the rocFFT patch is even needed because rocFFT has an "any" fallback arch. Perhaps the patch would allow rocFFT to select the most optimal algorithm instead of the generic fallback one but I haven't benchmarked it.

If you search the source code for "gfx1030", there's only two places in all of rocFFT that could be affected. IMO, you're better off benchmarking and submitting a proper patch upstream if those are useful optimizations on other gfx103x architectures that are being missed. The rocFFT library can nicely handle every GPU via runtime compilation, so I don't think there's a need for an ugly patch like was used in rocBLAS or MIOpen.

Sincerely,
Cory Blor


Reply to: