[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled



Hi Christian,

On 2023-11-24 03:26, Christian Kastner wrote:
On 2023-11-23 08:35, Cordell Bloor wrote:
On 2023-11-22 03:19, Christian Kastner wrote:
The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
the KConfig for "Enable HMM-based shared virtual memory manager", which
is required for xnack+ operation. The xnack feature allows some AMD GPUs
to retry memory accesses that fail due to a page fault, which is used as
a mechanism for migrating managed memory automatically from host to
device. With xnack disabled, page faults in device code are not
recoverable [1].
This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?
I suppose the downside would be that more code means more bugs. I'm not
sure what inclusion criteria is used by the maintainers, but it seems
you linked to [1] in one of your replies. Under "Supported Hardware",
the article states:

Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK, but only APU platforms enabled it by default. On dedicated graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due to stability concerns as it’s still an experimental feature.

For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is no longer supported. Only computing cards from the CDNA series has XNACK support, such as Instinct MI100 and MI200 - and they also belong to the GFX900 series.
I don't think the lack of official support is a problem here, evaluating
this is what we have our CI for. We could build an image with a fixed
kernel, and see what happens to tests there.

I think you've misunderstood this. AMD officially supports the Radeon Pro W6800 (gfx1030) when running on kernels with HSA_AMD_SVM enabled in the drivers. They provide a single amdgpu-dkms package for all officially supported GPUs.

Enabling HSA_AMD_SVM will not cause xnack to be used on GFX10/GFX11 GPUs. In fact, it is not sufficient to cause xnack to be used for most GFX9 GPUs, either. To use xnack, you need to write your program using hipMallocManaged rather than hipMalloc, build your software with the xnack compiler feature enabled, and (unless you have MI200 hardware) add amdgpu.noretry=0 to your kernel's boot parameters.

However, unlikely as it may seem, I'd still like to ask: is there any
risk of negatively affecting the graphics side of this? Can this change
somehow break a regular user's video output?

This is far-fetched, but it's not entirely inconceivable that some
external stack might rely on the current behavior.

Yes, there is always a risk when enabling a new feature that it will introduce bugs. I see there's an issue on the amdgpu bug tracker with a user who has both an AMD GPU and an NVIDIA GPU on their system. It seems that HSA_AMD_SVM is causing issues with switching the NVIDIA card back and forth between the host driver and vfio-pci [2].

As a workaround, I was hoping that setting HSA_XNACK=0 would disable the
check, but it didn't work on my end, unfortunately.

The HSA_XNACK environment variable only affects hardware where xnack can be enabled and disabled on a per-process basis. Everything prior to MI200 could only choose if xnack was enabled or disabled at boot time. It's an actual GPU hardware state. If your GPU supports xnack at all, the GPU state will be reported in rocminfo as gfxNNN:xnack- or gfxNNN:xnack+ (for xnack off and xnack on, respectively). If your GPU does not support xnack whatsoever, then the state won't be reported, but it will be equivalent to xnack-.

Sincerely, Cory Bloor

[1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/
[2]: https://gitlab.freedesktop.org/drm/amd/-/issues/2794
Reply to: