Re: Selecting appropriate GPU archs for testing (Was: ROCm 5.4.0 Released)

To: Cordell Bloor <cgmb-deb@slerp.xyz>, debian-ai@lists.debian.org
Subject: Re: Selecting appropriate GPU archs for testing (Was: ROCm 5.4.0 Released)
From: Christian Kastner <ckk@debian.org>
Date: Sun, 12 Feb 2023 16:45:34 +0100
Message-id: <[🔎] 32608dbc-d65f-c4ab-c3e0-275b2defe5ee@debian.org>
In-reply-to: <ee458d87-d218-a113-cfa4-bfeee225079f@slerp.xyz>
References: <ab8655b4-2d93-f5f5-e483-92f92e1ee2d1@slerp.xyz> <Y4nYNoJWjVZHjBX2@fusion> <bd8fafe2-5ef5-9e72-d332-7fbec32a6434@slerp.xyz> <ee458d87-d218-a113-cfa4-bfeee225079f@slerp.xyz>

Hi Cory,

apologies for the very late reply.

On 2022-12-27 11:21, Cordell Bloor wrote:
> It's my understanding that you were looking into improving the Debian
> test infrastructure for GPU accelerators. As I'm currently in the
> process of preparing the hipcub and rocthrust packages, I wonder if you
> had any opinions on how the GPU architecture should be selected for
> their tests. To remind you of the context a little bit,
> 
> On 2022-12-12 03:01, Cordell Bloor wrote:
>> For header-only libraries like rocPRIM, rocThrust, and hipCUB, passing
>> -DAMDGPU_TARGETS only affects the architecture that the tests are
>> built for. It therefore only makes sense to build for hardware that
>> Debian tests against.

Unless I'm misunderstanding something, I'd suggest we build for every
architecture we'd like to support, but run the tests only on the
architecture(s) available.

How to do this will be part of what we need to figure out, because our
current infra and tooling don't yet allow us to express dependencies on
particular hardware (apart from ISA).

I guess this could start with a hard-coded architecture (whichever one
we currently use the most) and then figure out how to best parametrize
it along the way?

> Do you have any thoughts on how the architectures should be selected?
> The package build time grows almost linearly with the number of
> architectures, so choosing carefully can significantly reduce the build
> resources required.

That's a good idea in general, but I wonder how much we are actually
resource-constrained.

I don't think we'd have a problem on an amd64, ppc64el or even arm64
buildd or CI host.

However, I see that many packages are arch:any. Do we really intend to
support ROCm on armhf or mipsel? I don't think we'll ever see a user
there, but every FTBFS would be RC and thus a maintenance burden, and so on.

> Sincerely,
> Cory Bloor
> 
> P.S. I noticed that GitLab uses ccache when building these GPU
> libraries. You need to be careful when you do that, because ccache has
> only partial support for HIP. It doesn't know to invalidate its cache
> upon changes to the compiler (since clang is used indirectly through the
> hipcc wrapper) or the device libraries.

That's good to know, thanks!

I don't think the official buildds use ccache, so we should be safe as
far as those are concerned.

Best,
Christian

Reply to:

Prev by Date: Processed: Re: Bug#1031121: rocm-smi -S crash with assertion val_vec VDDC_CURVE ...
Next by Date: pytorch is marked for autoremoval from testing
Previous by thread: Processed: Re: Bug#1031121: rocm-smi -S crash with assertion val_vec VDDC_CURVE ...
Next by thread: pytorch is marked for autoremoval from testing
Index(es):
- Date
- Thread