[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Towards testing all Navi 3x architectures



Hi Cory,

On 2024-08-12 07:40, Cordell Bloor wrote:
> We've been unable to use PCIe passthrough with the W7800, W7700 and
> W7500 that AMD provided for the CI. This is why only the W7800 is
> currently in active use. We could swap an RX 6800 XT from Trinity to
> Explorer and install the W7700 on Trinity to get one more architecture
> tested.

That could work. The RX 6800 XT is a three-slot card though, so one x16
slot in Explorer would be blocked/wasted.

I haven't given up hopes on pass-through yet, though.

> However, that still leaves us short one architecture. We may
> also want to consider acquiring at least one Navi 3x GPU that works with
> passthrough.>
> What models of Navi 3x GPUs can we get working with passthrough? Has
> anyone seen reports of the reference cards working with passthrough?

I've seen a few reports of success with some cards (don't recall which
at the moment), so it should at least be possible.

I hope to re-run a test with the W7800 soon.

> Or, maybe we should look into allowing multiple podman workers on one
> host?

I have the code ready our podman+rocm to accept --gpu arguments
analogous to how qemu+rocm does it, though that code still assumes
device selection is managed through /dev/dri/* as per [1], which as it
turns out is not possible in rootless podman [2, 3].

I guess this could be switched to ROCR_VISIBLE_DEVICES but I haven't
investigated yet.

Nevertheless, I consider podman the ultima ratio as it is a slightly
flawed form of testing, seeing as how tests always run using the host
kernel and firmware, rather than the distro-specific one.

If we go further down this path, I'd like to adapt debci or gpuenv-utils
such that we have release-specific worker instances and queues, and
hosts reboot into the right kernel between releases.

Best,
Christian

[1]: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#restricting-gpu-access
[2]: https://github.com/containers/podman/issues/21454
[3]: https://github.com/ROCm/ROCm/issues/2860


Reply to: