Hi Mo,These are expected issues. In my opinion, they should both be fixed in PyTorch.
On 2024-10-11 02:22, Mo Zhou wrote:
1. rocm_version.h cannot be found in any rocm package.
rocm_version.h is provided by librocm-core. The librocm-core library provides two functions: getROCmVersion and getROCmInstallPath. In my opinion, these functions are conceptually flawed and should never be used.
There is no correct way to use getROCmVersion. As librocm-core has no functionality of its own, the only thing that version number can be used for is to infer the versions of other tools and libraries. However, every single library and component has its own version number that can be checked directly. That is what PyTorch should be doing. Using the librocm-core version to infer the versions of other libraries will cause nothing but problems. In any rolling release distribution, there will be times at which there are a mixture of package versions installed on the system. In fact, during transitions between ABIs, there may be multiple versions of a library installed on the system at the same time.
For example, there may be librocsparse.so.0 for programs that were built against rocSPARSE from ROCm 5.7 or earlier, and librocsparse.so.1 for programs that were built against rocSPARSE from ROCm 6.0 and later. Instead of using getROCmVersion to check which version of rocSPARSE the application was built against, users should be checking rocsparse_get_version. Similarly, when compiling against librocsparse-dev and librocfft-dev, PyTorch should check ROCSPARSE_VERSION_{MAJOR,MINOR,PATCH} for rocsparse functionality and ROCFFT_VERSION_{MAJOR,MINOR_PATCH} for rocfft functionality, instead of just checking ROCM_VERSION_{MAJOR,MINOR,PATCH} and assuming that every library version matches librocm-core-dev.
getROCmInstallPath works fine for Debian, so I won't go into it here. It causes problems for packaging systems like Spack, where there is no single directory for all libraries.
PyTorch asks librocm-core for information that they use to then infer what functions are available in other ROCm libraries. It would be more robust to directly check the versions of the relevant libraries. The version checks should be patched and fixed upstream in PyTorch.
2. hipblaslt is missing while hipblas is present.
The hipblaslt library only supports CDNA 2, CDNA 3 and RDNA 3 GPUs. That is a subset of the AMD GPUs that PyTorch supports. The hipblaslt library could clearly be made optional, even if the PyTorch build system doesn't treat it as such.
While we should certainly finish packaging hipblaslt, but I think we may want help the upstream PyTorch project to make this an optional dependency. If nothing else, the library is ~10 GB in ROCm 6.3, and users with Vega and RDNA 2 GPUs may appreciate the disk space savings.
Sincerely, Cory Bloor