[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Enabling ROCm on Everything



On Mon, 2023-03-20 at 23:17 -0600, Cordell Bloor wrote:
> Hello everyone,
> 
> In the last round of updates to the ROCm packages on Unstable, I did a 
> bunch of testing with an RX 5700 XT (gfx1010) and Radeon Pro v520 
> (gfx1011). I found that all Debian packaged libraries passed their full 
> test suites (with the exception of an out-of-memory error in one 
> rocprim/hipcub test). So, now the rocRAND, hipRAND, rocPRIM, hipCUB, 
> rocSPARSE and hipSPARSE packages are enabled for gfx803, gfx900, gfx906, 
> gfx908, gfx90a, gfx1010, gfx1011 and gfx1030.
> 
> However, there is a cost to this. The rocsparse library is ~250 MiB, but 
> we are now building it for eight different GPU instruction sets. That is 
> why the library binary is now 1.96 GiB. There are a total of twenty six 
> instruction sets in the GFX9, GFX10 and GFX11 families. If you add 
> gfx803, that makes twenty seven architectures. If we were to enable 
> support all modern AMD GPUs [1], the total size of librocsparse.so would 
> be 0.25 GiB * 27 = 6.75 GiB [2]. For better or for worse, that does not 
> seem to actually be possible anyway. Once the size of the shared library 
> exceeds 2 GiB, it will become too large to use 32-bit relative offsets 
> and the library will fail to link.

There is exactly the same issue for pytorch-cuda. The upstream distributed
binaries put all supported cuda architectrures into a single fat binary,
and will cause linker error (file too large). The have some workarounds
like splitting the shared object into multiple ones, but the overall
binary size is still growing.

However, as long as the cuda compute architectures are backward-compatible,
we can just build several selected architectures that will work in most cases.
For instance, the upstream has built their binary release of pytorch-cuda
for the following cuda architectures:
  37, 50, 60, 61, 70, 75, 80, 86, 90
But I suppose 61, 75, and 86 will be sufficient for the debian build of
pytorch-cuda. These correspond to the GTX 1XXX, RTX 2XXX, and
RTX 3XXX series of GPUs. The users of datacenter GPUs are not likely
to use the debian packaged pytorch-cuda. In most cases they will
stick to anaconda. Even if the user has a datacenter GPU, the
code still runs thanks to backward compatibility.

If the same backward compatibility applies to the gfx architectures,
then I'd suggest building only several selected archs by default.

> There are some improvements coming for this situation that are targeted 
> for LLVM 17. Of the twenty six instruction sets in GFX9, GFX10 and 
> GFX11, only maybe thirteen of them are distinct. There was a period of 
> several years in which each new GPU was given its own unique instruction 
> set id. Many of the instruction sets are identical to each other, and 
> they will be consolidated where possible. Incidentally, the fact that 
> some of these ISAs are identical is why using the 
> HSA_OVERRIDE_GFX_VERSION environment variable can be used to safely 
> enable ROCm on some unsupported hardware. It is expected that the 
> upstream changes to consolidate ISAs will basically achieve the same 
> thing as the environment variable method, but without requiring user 
> intervention.
> 
> However, fourteen instruction sets are still too many to put all in a 
> single fat binary. As mentioned, rocsparse will fail to link if built 
> with more than eight (and even that is pushing dangerously close to the 
> limit). The fourteen instruction sets are gfx803, gfx900, gfx904, 
> gfx906, gfx908, gfx90a, gfx940, gfx1010, gfx1011, gfx1013, gfx1030, 
> gfx1100, gfx1101, and gfx1102. I don't think there's any reasonable way 
> for Debian to resolve this problem besides slicing the packages by 
> architecture.

Using the following python code
  import torch; print(torch.__config__.show())
can dump the pytorch official binary build configs. This can dump the
compiled cuda architectures, but it does not print the concrete rocm
architectures for their rocm build. In the upstream setup.py, the
comments gave an suggestion of setting the architectures as
  export PYTORCH_ROCM_ARCH="gfx900;gfx906"
I guess something like that would work.

> One possible split would be on the GFX architecture major version. There 
> would be binary packages for librocsparse0-gfx8, librocsparse0-gfx9, 
> librocsparse0-gfx10, and librocsparse0-gfx11 with each providing 
> librocsparse0. The GFX9 grouping would be pretty large with six 
> architectures, but that's still within acceptable limits. If need be, it 
> could be split into gfx9-gcn (gfx900, gfx904, gfx906) and gfx9-cnda 
> (gfx908, gfx90a, gfx940).

Theoretically this is an clean and elegant solution. But I forecast that
we don't have enough people to work on and maintain the sophisticated
dependency tree.

BTW, it will also result in very frequent entering to NEW queue, which
will drastically block the development process.

One single fat binary looks to cause the smallest overhead to human.
I really don't care about the overhead to machines even if there will
be some performance loss. Whatever solution that induces the least
amount of burden to human is the best choice for long term
maintenance.

> So, that's my proposal for enabling the ROCm libraries to run on all 
> modern AMD GPUs. I'm not sure how to structure a Debian package to do 
> this, but I hope that somebody finds the result to be an enticing enough 
> idea to provide some guidance. I imagine that we could build the library 
> multiple times, passing a different set of of values for 
> -DAMDGPU_TARGETS to cmake during configuration. I know that splitting 
> the libraries by architecture is not a popular solution, but I don't see 
> any other option that enables broad hardware support. To me, the mere 
> existence of a feasible pathway to broad hardware enablement is exciting.

I can provide some technical suggestions on the implementation of the
package split. But before that, I'd suggest we think twice about whether
it induces more cost to human, for instance:

1. will this significantly increase my working hour for the next time of update?
2. will another contributor be able to grasp the whole thing in short time?

> Sincerely,
> Cory Bloor
> 
> [1]: Every AMD GPU from Polaris to RDNA3 and CDNA3.
> [2]: It's also worth noting that the rocSPARSE library is not 
> particularly large. The rocBLAS and rocFFT libraries are both larger 
> than rocSPARSE.
> 


Reply to: