[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Enabling ROCm on Everything



Hello everyone,

In the last round of updates to the ROCm packages on Unstable, I did a bunch of testing with an RX 5700 XT (gfx1010) and Radeon Pro v520 (gfx1011). I found that all Debian packaged libraries passed their full test suites (with the exception of an out-of-memory error in one rocprim/hipcub test). So, now the rocRAND, hipRAND, rocPRIM, hipCUB, rocSPARSE and hipSPARSE packages are enabled for gfx803, gfx900, gfx906, gfx908, gfx90a, gfx1010, gfx1011 and gfx1030.

However, there is a cost to this. The rocsparse library is ~250 MiB, but we are now building it for eight different GPU instruction sets. That is why the library binary is now 1.96 GiB. There are a total of twenty six instruction sets in the GFX9, GFX10 and GFX11 families. If you add gfx803, that makes twenty seven architectures. If we were to enable support all modern AMD GPUs [1], the total size of librocsparse.so would be 0.25 GiB * 27 = 6.75 GiB [2]. For better or for worse, that does not seem to actually be possible anyway. Once the size of the shared library exceeds 2 GiB, it will become too large to use 32-bit relative offsets and the library will fail to link.

There are some improvements coming for this situation that are targeted for LLVM 17. Of the twenty six instruction sets in GFX9, GFX10 and GFX11, only maybe thirteen of them are distinct. There was a period of several years in which each new GPU was given its own unique instruction set id. Many of the instruction sets are identical to each other, and they will be consolidated where possible. Incidentally, the fact that some of these ISAs are identical is why using the HSA_OVERRIDE_GFX_VERSION environment variable can be used to safely enable ROCm on some unsupported hardware. It is expected that the upstream changes to consolidate ISAs will basically achieve the same thing as the environment variable method, but without requiring user intervention.

However, fourteen instruction sets are still too many to put all in a single fat binary. As mentioned, rocsparse will fail to link if built with more than eight (and even that is pushing dangerously close to the limit). The fourteen instruction sets are gfx803, gfx900, gfx904, gfx906, gfx908, gfx90a, gfx940, gfx1010, gfx1011, gfx1013, gfx1030, gfx1100, gfx1101, and gfx1102. I don't think there's any reasonable way for Debian to resolve this problem besides slicing the packages by architecture.

One possible split would be on the GFX architecture major version. There would be binary packages for librocsparse0-gfx8, librocsparse0-gfx9, librocsparse0-gfx10, and librocsparse0-gfx11 with each providing librocsparse0. The GFX9 grouping would be pretty large with six architectures, but that's still within acceptable limits. If need be, it could be split into gfx9-gcn (gfx900, gfx904, gfx906) and gfx9-cnda (gfx908, gfx90a, gfx940).

So, that's my proposal for enabling the ROCm libraries to run on all modern AMD GPUs. I'm not sure how to structure a Debian package to do this, but I hope that somebody finds the result to be an enticing enough idea to provide some guidance. I imagine that we could build the library multiple times, passing a different set of of values for -DAMDGPU_TARGETS to cmake during configuration. I know that splitting the libraries by architecture is not a popular solution, but I don't see any other option that enables broad hardware support. To me, the mere existence of a feasible pathway to broad hardware enablement is exciting.

Sincerely,
Cory Bloor

[1]: Every AMD GPU from Polaris to RDNA3 and CDNA3.
[2]: It's also worth noting that the rocSPARSE library is not particularly large. The rocBLAS and rocFFT libraries are both larger than rocSPARSE.


Reply to: