Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library

To: Christian Kastner <ckk@debian.org>, ROCm Team <debian-ai@lists.debian.org>
Subject: Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Tue, 26 Mar 2024 13:34:35 -0600
Message-id: <[🔎] 3ef99180-45a2-cf16-f395-6e01a630d373@slerp.xyz>
In-reply-to: <[🔎] 7b1fb422-f92e-4bd3-ae12-7aee3b276321@debian.org>
References: <[🔎] ceb5009c-0d9a-435f-a3c9-db65b539cb56@slerp.xyz> <[🔎] 7b1fb422-f92e-4bd3-ae12-7aee3b276321@debian.org>

Hi Christian,

On 2024-03-26 12:42, Christian Kastner wrote:

I assume this is still up-to-date? (I added a d/gbp.conf.)

Yes.

Slightly tangential: What do you think about setting up a specific
worker configuration for multi-GPU tests, for example configuring
pinwheel as
   * amd64+gfx90a when one GPU is in use
   * amd64+gfx90a_x2 (or similar) when both GPUs are in use?

pinwheel/gfx90a is just one example, other configuration would of course
also work.

I'm not sure. I thought amd64 and gfx90a were the ISAs, but x2 is more aquestion of system configuration. I may have misunderstand the meaning.

A related topic is that AMD is no longer following the 1:1 mappingbetween ISA and architecture that spawned the identicalgfx103{0,1,2,3,4,5,6} ISAs. The recent Mendocino chips (Radeon 610M)report themselves to the driver as gfx1037 for the gfxip, but clangdevelopers chose to reuse the gfx1036 ISA rather than creating yetanother identical gfx103x ISA. That used to happen more often. Thegfx803 ISA was used by many different GPUs. For example, the MI6 and MI8were Ellesmere and Fiji, respectively, but both were gfx803.

I don't really have any strong opinions about how the CI should handlesome of these more complex hardware requirements. Your suggestion seemsreasonable, although I'm not sure we want to add an amd64+gfx90a_x2 rowto each package status page. I think we could get away with our currentconfiguration for a while, if we want to spend more time thinking thisthrough. Argo currently has four gfx803 GPUs in the container when itruns the autopkgtests and it's currently working on the amd64+gfx803 queue.


Sincerely,
Cory Bloor

Reply to:

Follow-Ups:
- Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
  - From: Christian Kastner <ckk@debian.org>

References:
- RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: Re: CI: Updates to the scheduler
Next by Date: Bug#1056172: marked as done (librocprim-tests: Test failures when gfx1030 code is run on gfx1031 hardware)
Previous by thread: Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
Next by thread: Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library
Index(es):
- Date
- Thread