Enabling ROCm on Everything
Hello everyone,
In the last round of updates to the ROCm packages on Unstable, I did a
bunch of testing with an RX 5700 XT (gfx1010) and Radeon Pro v520
(gfx1011). I found that all Debian packaged libraries passed their full
test suites (with the exception of an out-of-memory error in one
rocprim/hipcub test). So, now the rocRAND, hipRAND, rocPRIM, hipCUB,
rocSPARSE and hipSPARSE packages are enabled for gfx803, gfx900, gfx906,
gfx908, gfx90a, gfx1010, gfx1011 and gfx1030.
However, there is a cost to this. The rocsparse library is ~250 MiB, but
we are now building it for eight different GPU instruction sets. That is
why the library binary is now 1.96 GiB. There are a total of twenty six
instruction sets in the GFX9, GFX10 and GFX11 families. If you add
gfx803, that makes twenty seven architectures. If we were to enable
support all modern AMD GPUs [1], the total size of librocsparse.so would
be 0.25 GiB * 27 = 6.75 GiB [2]. For better or for worse, that does not
seem to actually be possible anyway. Once the size of the shared library
exceeds 2 GiB, it will become too large to use 32-bit relative offsets
and the library will fail to link.
There are some improvements coming for this situation that are targeted
for LLVM 17. Of the twenty six instruction sets in GFX9, GFX10 and
GFX11, only maybe thirteen of them are distinct. There was a period of
several years in which each new GPU was given its own unique instruction
set id. Many of the instruction sets are identical to each other, and
they will be consolidated where possible. Incidentally, the fact that
some of these ISAs are identical is why using the
HSA_OVERRIDE_GFX_VERSION environment variable can be used to safely
enable ROCm on some unsupported hardware. It is expected that the
upstream changes to consolidate ISAs will basically achieve the same
thing as the environment variable method, but without requiring user
intervention.
However, fourteen instruction sets are still too many to put all in a
single fat binary. As mentioned, rocsparse will fail to link if built
with more than eight (and even that is pushing dangerously close to the
limit). The fourteen instruction sets are gfx803, gfx900, gfx904,
gfx906, gfx908, gfx90a, gfx940, gfx1010, gfx1011, gfx1013, gfx1030,
gfx1100, gfx1101, and gfx1102. I don't think there's any reasonable way
for Debian to resolve this problem besides slicing the packages by
architecture.
One possible split would be on the GFX architecture major version. There
would be binary packages for librocsparse0-gfx8, librocsparse0-gfx9,
librocsparse0-gfx10, and librocsparse0-gfx11 with each providing
librocsparse0. The GFX9 grouping would be pretty large with six
architectures, but that's still within acceptable limits. If need be, it
could be split into gfx9-gcn (gfx900, gfx904, gfx906) and gfx9-cnda
(gfx908, gfx90a, gfx940).
So, that's my proposal for enabling the ROCm libraries to run on all
modern AMD GPUs. I'm not sure how to structure a Debian package to do
this, but I hope that somebody finds the result to be an enticing enough
idea to provide some guidance. I imagine that we could build the library
multiple times, passing a different set of of values for
-DAMDGPU_TARGETS to cmake during configuration. I know that splitting
the libraries by architecture is not a popular solution, but I don't see
any other option that enables broad hardware support. To me, the mere
existence of a feasible pathway to broad hardware enablement is exciting.
Sincerely,
Cory Bloor
[1]: Every AMD GPU from Polaris to RDNA3 and CDNA3.
[2]: It's also worth noting that the rocSPARSE library is not
particularly large. The rocBLAS and rocFFT libraries are both larger
than rocSPARSE.
Reply to: