Hi Christian,
The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is the KConfig for "Enable HMM-based shared virtual memory manager", which is required for xnack+ operation. The xnack feature allows some AMD GPUs to retry memory accesses that fail due to a page fault, which is used as a mechanism for migrating managed memory automatically from host to device. With xnack disabled, page faults in device code are not recoverable [1].I've rebuilt our kernel with this option enabled, and the message indeed went away. Great! This also required DEVICE_PRIVATE (and that one also suggests HMM_MIRROR). I don't see any downside to these; should we request them from the Kernel Team?
I suppose the downside would be that more code means more bugs. I'm not sure what inclusion criteria is used by the maintainers, but it seems like a reasonable request.
That did remind me of another message I've seen in dmesg, repeated a few dozen times, when some (but not all) tests are run: amdgpu: init_user_pages: Failed to get user pages: -1 rocrand is a good example where these occur. Despite the failure, I did not observe any negative side effects, but the above change also did not solve this. Have you seen this message in dmesg as well?
Yes, it can be observed in the logs I captured [1]. I'm not sure
what it means. I'll ask.
Sincerely,
Cory Bloor
[1]: https://lists.debian.org/debian-ai/2023/11/msg00043.html