Re: Enabling ROCm on Everything

To: Cordell Bloor <cgmb-deb@slerp.xyz>, debian-ai <debian-ai@lists.debian.org>
Subject: Re: Enabling ROCm on Everything
From: "M. Zhou" <lumin@debian.org>
Date: Wed, 22 Mar 2023 12:44:08 -0400
Message-id: <[🔎] 5bd6aef1a5fb96bb38804778331ed18adddb8352.camel@riseup.net>
In-reply-to: <[🔎] 5ddf28fd-d279-7357-ce98-7b5a8773fde3@slerp.xyz>
References: <[🔎] 099ceb80-9d22-6a82-d0b7-723fb69889af@slerp.xyz> <[🔎] eca176fc4f2e859aca30dc0d5cfffd9183213854.camel@riseup.net> <[🔎] 5ddf28fd-d279-7357-ce98-7b5a8773fde3@slerp.xyz>

On Tue, 2023-03-21 at 18:31 -0600, Cordell Bloor wrote:
> On 2023-03-21 12:41, Christian Kastner wrote:
> > One difficulty we will need to figure out one way or another is how to
> > actually bring the user to the right package. What do we do when the
> > user wants to `apt install pytorch-rocm`?
> Maybe it should be `apt install pytorch-rocm-gfx<N>`? The user already 
> needs to know their hardware to choose between pytorch-cuda, 
> pytorch-rocm and pytorch-oneapi. It is a more burdensome to ask the user 
> to be more specific about their hardware than just specifying the 
> vendor, but that seems more like a matter of degree than a fundamental 
> difference.

Given the lack of backward compatibility in the compiled GPU code, now
I agree that the pytorch-rocm-gfx<N> is a proper way to go. But I'd still
suggest the ROCm upstream reconsider the backward compatibility,
as there are already obvious issues led by the abundance of GPU architectures.

> > Another difficulty we might need to consider is: what if the system in
> > question contains multiple GPU architectures (e.g. 6800 XT and 7900 XT)?
> 
> I think the sad truth is that it's not technically feasible for Debian 
> to handle every possible hardware configuration. The solution I propose 
> handles all single-GPU systems and many systems with a combination of 
> GPUs, but it wouldn't handle the specific case that you mentioned.
> 
> I suppose if the -gfx10 and -gfx11 packages installed to someplace like 
> /usr/lib/<host-target>/<device-target>/libfoo.so, then you could use 
> environment variables like LD_LIBRARY_PATH and ROCR_VISIBLE_DEVICES to 
> use the GPUs separately. You would not be able to have both devices 
> visible in the same process because the HIP runtime will throw an error 
> if you do not have kernels for all visible devices.
> 
> Users with more esoteric needs should probably be referred to a more 
> customizable package management tool. That sort of thing is a good use 
> case for Spack [1]. It builds packages from source and is thus much 
> slower than installing with apt, but it can handle much more complex 
> customization. `spack install <package> amdgpu_target==gfx1030,gfx1100` 
> will build the libraries you need for that configuration.

Do you mean the expected binaries are libxxx-gfx9, libxxx-gfx10, libxxx-gfx11 ?
This sounds like somewhere between my suggested one single fat binary and
your suggested fine-grained split (like libxxx-gfx900, libxxx-gfx906, etc).

According to
https://llvm.org/docs/AMDGPUUsage.html
It does not seem like that squashing all gfx9XX code into one shared object
will not lead to a super giant libxxx-gfx9 package.

This would be simultaneously less burdensome to human, and flexible enough.

As for the co-existence of -gfx9 and -gfx10 libraries... I  kind of want to avoid it,
because it will definitely cause confusion to the user because we can never make
the package work out of the box. The users are forced to learn LD_LIBRARY_PATH,
which is an obvious red flag.

By making the libxxx-gfx9 and libxxx-gfx10 conflict to each other and not
co-installable, the users have made their (correct) choice at the installation time.

> On 2023-03-21 13:58, M. Zhou wrote:
> 
> In general, there is no compatibility between the GFX ISAs. If you were 
> to drop an ISA from the fat binary, it wouldn't mean reduced performance 
> on the hardware matching that ISA. It would mean completely dropping 
> support for that hardware. While CUDA compiles to PTX bytecode, HIP 
> compiles to machine code. There is no hardware abstraction layer to hide 
> the differences between processors.

In this case I prefer the pytorch-rocm-gfx9, pytorch-rocm-gfx10, pytorch-rocm-gfx11
granularity. There are not too many sub architecture for each gfx series.

> > 
> > BTW, it will also result in very frequent entering to NEW queue, which
> > will drastically block the development process.
> 
> It would result in a trip to the new queue each time a new binary 
> package is added, which would occur whenever we add a package for a new 
> GFX major version. However, that could only occur after (1) a new 
> generation of hardware is released, and (2) a new major version of LLVM 
> is packaged.
> 
> If we look at this history of new architecture major versions, GFX9 was 
> introduced with Vega in 2017, GFX10 was introduced with RDNA1 in 2019, 
> and GFX11 was introduced with RDNA3 in 2022. I'm not sure what is the 
> 'normal' frequency for packages going through NEW, but every couple 
> years doesn't seem that bad.
> 
> Also, I think we'd introduce this sort of packaging change at the same 
> time as updating to ROCm 6.0. The ABI changes in that release will 
> necessitate a trip through the new queue anyway.

Fair enough.

> > One single fat binary looks to cause the smallest overhead to human.
> > I really don't care about the overhead to machines even if there will
> > be some performance loss. Whatever solution that induces the least
> > amount of burden to human is the best choice for long term
> > maintenance.
> As far as I know, a single fat shared object library is not technically 
> possible while supporting all architectures. A single binary package 
> with multiple shared libraries might be possible, but the total 
> installed size would be enormous.
> > I can provide some technical suggestions on the implementation of the
> > package split. But before that, I'd suggest we think twice about whether
> > it induces more cost to human, for instance:
> > 
> > 1. will this significantly increase my working hour for the next time of update?
> > 2. will another contributor be able to grasp the whole thing in short time?
> 
> This proposal would significantly increase the time required to update 
> the libraries. If nothing else, expanding the architecture support would 
> significantly increase the time required to build. Whether it would be 
> difficult for another contributor to grasp, I'm not sure.

The proposed -gfx9, -gfx10, -gfx11 granularity looks acceptable to me.
Finer granularity like -gfx900, -gfx906, -gfxXXX will require some
scripting work for automatic code generation, but ... I'm not object
to it if it will not be me to write and maintain the script :-)

Reply to:

References:
- Enabling ROCm on Everything
  - From: Cordell Bloor <cgmb-deb@slerp.xyz>
- Re: Enabling ROCm on Everything
  - From: "M. Zhou" <lumin@debian.org>
- Re: Enabling ROCm on Everything
  - From: Cordell Bloor <cgmb-deb@slerp.xyz>

Prev by Date: Re: Enabling ROCm on Everything
Next by Date: Re: pytorch and CUDA
Previous by thread: Re: Enabling ROCm on Everything
Next by thread: Re: Enabling ROCm on Everything
Index(es):
- Date
- Thread