[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFS: rccl/5.4.3-3~exp1 -- ROCm Communication Collectives Library



OK, it's building on my end for a final check, but that might take until
tomorrow (the buildd logs say it's 2h+ build.)

On 2024-03-26 20:34, Cordell Bloor wrote:
> I'm not sure. I thought amd64 and gfx90a were the ISAs, but x2 is more a
> question of system configuration. I may have misunderstand the meaning.

No, your understanding is correct. I got a bit carried away that we can
use the +modifier any way we want, including system configuration. But
that's not what we originally aimed for, and it's probably not wise to
change the meaning now.

> A related topic is that AMD is no longer following the 1:1 mapping
> between ISA and architecture that spawned the identical
> gfx103{0,1,2,3,4,5,6} ISAs. The recent Mendocino chips (Radeon 610M)
> report themselves to the driver as gfx1037 for the gfxip, but clang
> developers chose to reuse the gfx1036 ISA rather than creating yet
> another identical gfx103x ISA. That used to happen more often. The
> gfx803 ISA was used by many different GPUs. For example, the MI6 and MI8
> were Ellesmere and Fiji, respectively, but both were gfx803.

What would the effect be to us? (But this is something that can be
discussed at another time)

> I don't really have any strong opinions about how the CI should handle
> some of these more complex hardware requirements. Your suggestion seems
> reasonable, although I'm not sure we want to add an amd64+gfx90a_x2 row
> to each package status page. I think we could get away with our current
> configuration for a while, if we want to spend more time thinking this
> through. Argo currently has four gfx803 GPUs in the container when it
> runs the autopkgtests and it's currently working on the amd64+gfx803 queue.

I had two things in mind:

First, for any library, it would be nice to have multi-GPU test results,
as I suspect there are failure modes that we cannot recognize on a
single-GPU system -- iv only in amdgpu, for example.

Second, it would be nice to have these results distinct from other
results. Otherwise they'd be commingled in one arch, which could be
confusing when multi-GPU issues do exist.

I defer to your experience: if you say that upstream, multi-GPU is
almost never a problem, then pursuing this is probably not worthwhile,
short-term.

Best,
Christian


Reply to: