Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled

To: Christian Kastner <ckk@debian.org>, 1056170@bugs.debian.org
Subject: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Fri, 24 Nov 2023 08:42:25 -0700
Message-id: <[🔎] 2642d9d3-70a6-4d45-b18a-28b69bac5728@slerp.xyz>
Reply-to: Cordell Bloor <cgmb@slerp.xyz>, 1056170@bugs.debian.org
In-reply-to: <[🔎] 7f65707a-6025-4a8d-97df-38f27308d5ec@debian.org>
References: <[🔎] 170029315883.4509.11200496869648520679.reportbug@8546943794cb> <[🔎] c4a525ba-b430-4098-b31d-6b102abfb98d@slerp.xyz> <[🔎] 358c2dbb-2fbf-4105-a1f3-39403f0286ae@debian.org> <[🔎] 170029315883.4509.11200496869648520679.reportbug@8546943794cb> <[🔎] 997a129b-64d8-4ca0-ac16-a062e950c8a4@slerp.xyz> <[🔎] 7f65707a-6025-4a8d-97df-38f27308d5ec@debian.org> <[🔎] 170029315883.4509.11200496869648520679.reportbug@8546943794cb>

Hi Christian,

On 2023-11-24 03:26, Christian Kastner wrote:

On 2023-11-23 08:35, Cordell Bloor wrote:

On 2023-11-22 03:19, Christian Kastner wrote:

The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
the KConfig for "Enable HMM-based shared virtual memory manager", which
is required for xnack+ operation. The xnack feature allows some AMD GPUs
to retry memory accesses that fail due to a page fault, which is used as
a mechanism for migrating managed memory automatically from host to
device. With xnack disabled, page faults in device code are not
recoverable [1].

This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?

I suppose the downside would be that more code means more bugs. I'm not
sure what inclusion criteria is used by the maintainers, but it seems

you linked to [1] in one of your replies. Under "Supported Hardware",
the article states:

Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK, but only APU platforms enabled it by default. On dedicated graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due to stability concerns as it’s still an experimental feature.

For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is no longer supported. Only computing cards from the CDNA series has XNACK support, such as Instinct MI100 and MI200 - and they also belong to the GFX900 series.

I don't think the lack of official support is a problem here, evaluating
this is what we have our CI for. We could build an image with a fixed
kernel, and see what happens to tests there.

I think you've misunderstood this. AMD officially supports the Radeon Pro W6800 (gfx1030) when running on kernels with HSA_AMD_SVM enabled in the drivers. They provide a single amdgpu-dkms package for all officially supported GPUs.

Enabling HSA_AMD_SVM will not cause xnack to be used on GFX10/GFX11 GPUs. In fact, it is not sufficient to cause xnack to be used for most GFX9 GPUs, either. To use xnack, you need to write your program using hipMallocManaged rather than hipMalloc, build your software with the xnack compiler feature enabled, and (unless you have MI200 hardware) add amdgpu.noretry=0 to your kernel's boot parameters.

However, unlikely as it may seem, I'd still like to ask: is there any
risk of negatively affecting the graphics side of this? Can this change
somehow break a regular user's video output?

This is far-fetched, but it's not entirely inconceivable that some
external stack might rely on the current behavior.

Yes, there is always a risk when enabling a new feature that it will introduce bugs. I see there's an issue on the amdgpu bug tracker with a user who has both an AMD GPU and an NVIDIA GPU on their system. It seems that HSA_AMD_SVM is causing issues with switching the NVIDIA card back and forth between the host driver and vfio-pci [2].

As a workaround, I was hoping that setting HSA_XNACK=0 would disable the
check, but it didn't work on my end, unfortunately.

The HSA_XNACK environment variable only affects hardware where xnack can be enabled and disabled on a per-process basis. Everything prior to MI200 could only choose if xnack was enabled or disabled at boot time. It's an actual GPU hardware state. If your GPU supports xnack at all, the GPU state will be reported in rocminfo as gfxNNN:xnack- or gfxNNN:xnack+ (for xnack off and xnack on, respectively). If your GPU does not support xnack whatsoever, then the state won't be reported, but it will be equivalent to xnack-.

Sincerely, Cory Bloor

[1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/

[2]: https://gitlab.freedesktop.org/drm/amd/-/issues/2794

Reply to:

Follow-Ups:
- Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Christian Kastner <ckk@debian.org>

References:
- Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Christian Kastner <ckk@debian.org>
- Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
Next by Date: Bug#1056667: librocthrust-tests: test failures across all architectures
Previous by thread: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
Next by thread: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled
Index(es):
- Date
- Thread