Hi Christian,
On 2023-11-23 08:35, Cordell Bloor wrote:On 2023-11-22 03:19, Christian Kastner wrote:The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is the KConfig for "Enable HMM-based shared virtual memory manager", which is required for xnack+ operation. The xnack feature allows some AMD GPUs to retry memory accesses that fail due to a page fault, which is used as a mechanism for migrating managed memory automatically from host to device. With xnack disabled, page faults in device code are not recoverable [1].This also required DEVICE_PRIVATE (and that one also suggests HMM_MIRROR). I don't see any downside to these; should we request them from the Kernel Team?I suppose the downside would be that more code means more bugs. I'm not sure what inclusion criteria is used by the maintainers, but it seemsyou linked to [1] in one of your replies. Under "Supported Hardware", the article states:Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK, but only APU platforms enabled it by default. On dedicated graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due to stability concerns as it’s still an experimental feature. For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is no longer supported. Only computing cards from the CDNA series has XNACK support, such as Instinct MI100 and MI200 - and they also belong to the GFX900 series.I don't think the lack of official support is a problem here, evaluating this is what we have our CI for. We could build an image with a fixed kernel, and see what happens to tests there.
I think you've misunderstood this. AMD officially supports the Radeon Pro W6800 (gfx1030) when running on kernels with HSA_AMD_SVM enabled in the drivers. They provide a single amdgpu-dkms package for all officially supported GPUs.
Enabling HSA_AMD_SVM will not cause xnack to be used on GFX10/GFX11 GPUs. In fact, it is not sufficient to cause xnack to be used for most GFX9 GPUs, either. To use xnack, you need to write your program using hipMallocManaged rather than hipMalloc, build your software with the xnack compiler feature enabled, and (unless you have MI200 hardware) add amdgpu.noretry=0 to your kernel's boot parameters.
However, unlikely as it may seem, I'd still like to ask: is there any risk of negatively affecting the graphics side of this? Can this change somehow break a regular user's video output? This is far-fetched, but it's not entirely inconceivable that some external stack might rely on the current behavior.
Yes, there is always a risk when enabling a new feature that it will introduce bugs. I see there's an issue on the amdgpu bug tracker with a user who has both an AMD GPU and an NVIDIA GPU on their system. It seems that HSA_AMD_SVM is causing issues with switching the NVIDIA card back and forth between the host driver and vfio-pci [2].
As a workaround, I was hoping that setting HSA_XNACK=0 would disable the check, but it didn't work on my end, unfortunately.
The HSA_XNACK environment variable only affects hardware where xnack can be enabled and disabled on a per-process basis. Everything prior to MI200 could only choose if xnack was enabled or disabled at boot time. It's an actual GPU hardware state. If your GPU supports xnack at all, the GPU state will be reported in rocminfo as gfxNNN:xnack- or gfxNNN:xnack+ (for xnack off and xnack on, respectively). If your GPU does not support xnack whatsoever, then the state won't be reported, but it will be equivalent to xnack-.
Sincerely, Cory Bloor
[2]: https://gitlab.freedesktop.org/drm/amd/-/issues/2794[1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/