[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled



Hi Christian,

On 2023-11-22 03:19, Christian Kastner wrote:
The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
the KConfig for "Enable HMM-based shared virtual memory manager", which
is required for xnack+ operation. The xnack feature allows some AMD GPUs
to retry memory accesses that fail due to a page fault, which is used as
a mechanism for migrating managed memory automatically from host to
device. With xnack disabled, page faults in device code are not
recoverable [1].
I've rebuilt our kernel with this option enabled, and the message indeed
went away. Great!

This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?

I suppose the downside would be that more code means more bugs. I'm not sure what inclusion criteria is used by the maintainers, but it seems like a reasonable request.

That did remind me of another message I've seen in dmesg, repeated a
few dozen times, when some (but not all) tests are run:

    amdgpu: init_user_pages: Failed to get user pages: -1

rocrand is a good example where these occur.

Despite the failure, I did not observe any negative side effects, but
the above change also did not solve this. Have you seen this message in
dmesg as well?

Yes, it can be observed in the logs I captured [1]. I'm not sure what it means. I'll ask.

Sincerely,
Cory Bloor

[1]: https://lists.debian.org/debian-ai/2023/11/msg00043.html


Reply to: