Hi Brian,
I'd like to keep Debian stable on my host, and pass the GPU into VMs as needed for data science projects.
I don't mean to be discouraging (because it would be great if you
did manage to get PCIe passthrough working), but I guarantee you
will find it _much_ easier to use docker or podman instead of a
VM. You would, however, want to use a newer driver and firmware
than is available on bookworm.
To add to your collection of links on VFIO with RDNA 3 GPUs, I would add, "7900XTX Passthrough PCI reset seems not working" from Reddit [1], which includes this bit of info that might be relevant to the handling of devices that have failed a health check on CI:
For what it's worth, I've found a host suspend to RAM & resume to be a pretty reliable fix for PCI hardware issues, both for GPUs that won't rebind to the AMD driver on Linux and for USB controllers that refuse to properly reset. Less painful than a full reboot. If you're fast enough you won't even lose network connections.
and
I have same issue on a Gigabyte 6700XT. Found this on some forums I use in a script with sudo privelleges:#!/bin/bash echo 1 > /sys/bus/pci/devices/0000:4e:00.0/remove echo 1 > /sys/bus/pci/devices/0000:4e:00.1/remove echo "Suspending..." rtcwake -m no -s 4 systemctl suspend sleep 5s echo 1 > /sys/bus/pci/rescan echo "Reset done"
There is also a very good description of the core problem in "The state of AMD RX 7000 Series VFIO Passthrough (April 2024)" on the Level1Techs forums [2]:
The Reset IssueYes, it’s back again, the RX 7000 series GPUs do not reset. So what? what are the implications, etc…
- If the GPU has crashed due to a fault and/or bug, or whatever, it can’t be brought back into a good state reliably. Sometimes it is still recoverable but not always. This is not such a huge deal for us home gamers, crashes of this nature are somewhat rare and we all know we should shutdown the guest OS when stopping the VM.
If the GPU has been used in Windows, or Linux, the drivers upload firmwares to the GPU to operate on. The firmware images used are different and incompatible, the firmware for Linux does not work with the Windows drivers, etc. Without the abillity to reset the GPU there is no way to unload/reset the GPU to accept a different firmware.
Once the GPU has been posted once by either your motherboard BIOS or the VMs BIOS, allowing it to POST again, will corrupt the GPUs state, usually requiring a cold boot of the host system to recover it.
There is significantly more detail in the post if you follow the link. I cannot vouch for the accuracy of the rest of the document, but the description of the reset issue appears to match what Alex Deucher and Alexandru Voicu described to me — at least, to the best of my understanding, as this is not my area of expertise.
Sincerely,
Cory Bloor
[1]:
https://www.reddit.com/r/VFIO/comments/zn0zdm/7900xtx_passthrough_pci_reset_seems_not_working/
[2]:
https://forum.level1techs.com/t/the-state-of-amd-rx-7000-series-vfio-passthrough-april-2024/210242