[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Enabling ROCm on Everything




On 2023-03-21 12:41, Christian Kastner wrote:
One difficulty we will need to figure out one way or another is how to
actually bring the user to the right package. What do we do when the
user wants to `apt install pytorch-rocm`?
Maybe it should be `apt install pytorch-rocm-gfx<N>`? The user already needs to know their hardware to choose between pytorch-cuda, pytorch-rocm and pytorch-oneapi. It is a more burdensome to ask the user to be more specific about their hardware than just specifying the vendor, but that seems more like a matter of degree than a fundamental difference.
Another difficulty we might need to consider is: what if the system in
question contains multiple GPU architectures (e.g. 6800 XT and 7900 XT)?

I think the sad truth is that it's not technically feasible for Debian to handle every possible hardware configuration. The solution I propose handles all single-GPU systems and many systems with a combination of GPUs, but it wouldn't handle the specific case that you mentioned.

I suppose if the -gfx10 and -gfx11 packages installed to someplace like /usr/lib/<host-target>/<device-target>/libfoo.so, then you could use environment variables like LD_LIBRARY_PATH and ROCR_VISIBLE_DEVICES to use the GPUs separately. You would not be able to have both devices visible in the same process because the HIP runtime will throw an error if you do not have kernels for all visible devices.

Users with more esoteric needs should probably be referred to a more customizable package management tool. That sort of thing is a good use case for Spack [1]. It builds packages from source and is thus much slower than installing with apt, but it can handle much more complex customization. `spack install <package> amdgpu_target==gfx1030,gfx1100` will build the libraries you need for that configuration.

On 2023-03-21 13:58, M. Zhou wrote:
There is exactly the same issue for pytorch-cuda. The upstream distributed
binaries put all supported cuda architectrures into a single fat binary,
and will cause linker error (file too large). The have some workarounds
like splitting the shared object into multiple ones, but the overall
binary size is still growing.

However, as long as the cuda compute architectures are backward-compatible,
we can just build several selected architectures that will work in most cases.
[...]
If the same backward compatibility applies to the gfx architectures,
then I'd suggest building only several selected archs by default.

In general, there is no compatibility between the GFX ISAs. If you were to drop an ISA from the fat binary, it wouldn't mean reduced performance on the hardware matching that ISA. It would mean completely dropping support for that hardware. While CUDA compiles to PTX bytecode, HIP compiles to machine code. There is no hardware abstraction layer to hide the differences between processors.

One possible split would be on the GFX architecture major version. There
would be binary packages for librocsparse0-gfx8, librocsparse0-gfx9,
librocsparse0-gfx10, and librocsparse0-gfx11 with each providing
librocsparse0. The GFX9 grouping would be pretty large with six
architectures, but that's still within acceptable limits. If need be, it
could be split into gfx9-gcn (gfx900, gfx904, gfx906) and gfx9-cnda
(gfx908, gfx90a, gfx940).
Theoretically this is an clean and elegant solution. But I forecast that
we don't have enough people to work on and maintain the sophisticated
dependency tree.

BTW, it will also result in very frequent entering to NEW queue, which
will drastically block the development process.

It would result in a trip to the new queue each time a new binary package is added, which would occur whenever we add a package for a new GFX major version. However, that could only occur after (1) a new generation of hardware is released, and (2) a new major version of LLVM is packaged.

If we look at this history of new architecture major versions, GFX9 was introduced with Vega in 2017, GFX10 was introduced with RDNA1 in 2019, and GFX11 was introduced with RDNA3 in 2022. I'm not sure what is the 'normal' frequency for packages going through NEW, but every couple years doesn't seem that bad.

Also, I think we'd introduce this sort of packaging change at the same time as updating to ROCm 6.0. The ABI changes in that release will necessitate a trip through the new queue anyway.

One single fat binary looks to cause the smallest overhead to human.
I really don't care about the overhead to machines even if there will
be some performance loss. Whatever solution that induces the least
amount of burden to human is the best choice for long term
maintenance.
As far as I know, a single fat shared object library is not technically possible while supporting all architectures. A single binary package with multiple shared libraries might be possible, but the total installed size would be enormous.
I can provide some technical suggestions on the implementation of the
package split. But before that, I'd suggest we think twice about whether
it induces more cost to human, for instance:

1. will this significantly increase my working hour for the next time of update?
2. will another contributor be able to grasp the whole thing in short time?

This proposal would significantly increase the time required to update the libraries. If nothing else, expanding the architecture support would significantly increase the time required to build. Whether it would be difficult for another contributor to grasp, I'm not sure.

On 2023-03-21 14:07, M. Zhou wrote:
OK. Although I think most users will still use anaconda (including myself),
we can only see the popcon data after the upload.
As far as I know, there is no binary distribution that would have AMD GPU hardware support as wide-ranging as Debian. I think it would be quite a draw if Debian provided them.

Sincerely,
Cory Bloor

[1]: https://spack.io/


Reply to: