gfx1100: first successes with pass-through
Hi all,
On 2024-08-12 07:40, Cordell Bloor wrote:
> We've been unable to use PCIe passthrough with the W7800, W7700 and
> W7500 that AMD provided for the CI. This is why only the W7800 is
> currently in active use. We could swap an RX 6800 XT from Trinity to
> Explorer and install the W7700 on Trinity to get one more architecture
> tested. However, that still leaves us short one architecture. We may
> also want to consider acquiring at least one Navi 3x GPU that works with
> passthrough.
>
> What models of Navi 3x GPUs can we get working with passthrough? Has
> anyone seen reports of the reference cards working with passthrough? Or,
> maybe we should look into allowing multiple podman workers on one host?
I vaguely remember pass-through not-insta-crashing the VM on GPU use
when I first experimented with gfx1100 in December, so last weekend, I
gave it another try.
I tested kernels 6.3 to 6.10 and firmwares 20230210 to 20240709, hoping
I could at least reproduce a non-fatal result with the older versions.
The good news: pass-through seems to work with kernels 6.3-6.6. I ran
hipsolver tests and they finished without a fuss.
Starting with 6.7, all attempts to use the GPU result in the error
message Brian already reported: "error: kvm run failed Bad address".
This is regardless of the firmware version.
I can't see anything obvious in the 6.7 changelog [1], but there are
numerous memory management changes and the "Bad address" thing
seems to be related to page faulting in some way, from what I found when
searching the web for this message.
I did not yet investigate further and don't know when I'll have the next
chance to do that, but if anyone beats me to it, I assume that bisecting
6.6 to 6.7 will reveal the root cause. Who knows, it might be a small
thing.
You can find my full test matrix (kernels, firmware) and notes here [2].
I have dmesgs from before and after tests (where applicable), but
haven't analyzed them yet. I'll eventually host them somewhere. I just
thought it might be best to share the intermediate results.
Best,
Christian
PS: Note that hipsolver was chosen because it's a simple and fast test
suite, so it's probably not the strongest indicator for overall
support.
[1]: https://kernelnewbies.org/Linux_6.7
[2]: https://salsa.debian.org/rocm-team/community/team-project/-/wikis/Navi-3x-QEMU-pass-through
Reply to: