[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled



Hi Cory,

On 2023-11-23 08:35, Cordell Bloor wrote:
> On 2023-11-22 03:19, Christian Kastner wrote:
>>> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
>>> the KConfig for "Enable HMM-based shared virtual memory manager", which
>>> is required for xnack+ operation. The xnack feature allows some AMD GPUs
>>> to retry memory accesses that fail due to a page fault, which is used as
>>> a mechanism for migrating managed memory automatically from host to
>>> device. With xnack disabled, page faults in device code are not
>>> recoverable [1].
>> I've rebuilt our kernel with this option enabled, and the message indeed
>> went away. Great!
>>
>> This also required DEVICE_PRIVATE (and that one also suggests
>> HMM_MIRROR). I don't see any downside to these; should we request them
>> from the Kernel Team?
> 
> I suppose the downside would be that more code means more bugs. I'm not
> sure what inclusion criteria is used by the maintainers, but it seems

you linked to [1] in one of your replies. Under "Supported Hardware",
the article states:

> Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK, but only APU platforms enabled it by default. On dedicated graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due to stability concerns as it’s still an experimental feature.
> 
> For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is no longer supported. Only computing cards from the CDNA series has XNACK support, such as Instinct MI100 and MI200 - and they also belong to the GFX900 series.

I don't think the lack of official support is a problem here, evaluating
this is what we have our CI for. We could build an image with a fixed
kernel, and see what happens to tests there.

However, unlikely as it may seem, I'd still like to ask: is there any
risk of negatively affecting the graphics side of this? Can this change
somehow break a regular user's video output?

This is far-fetched, but it's not entirely inconceivable that some
external stack might rely on the current behavior.

As a workaround, I was hoping that setting HSA_XNACK=0 would disable the
check, but it didn't work on my end, unfortunately.

Best,
Christian

> [1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/


Reply to: