[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled



Hey Cory,

On 2023-11-21 21:01, Cordell Bloor wrote:
> On 2023-11-18 00:39, Cordell Bloor wrote:
>> Each time a HIP application is executed, the rocr-runtime prints the message:
>>
>>     KFD does not support xnack mode query.
>>     ROCr must assume xnack is disabled.
>>
>> It is unclear to me whether something is actually wrong or not. This
>> message is emitted from a debug_print statement in amd_topology.cpp. An
>> example of this message can be found in the CI logs [1].
> 
> This is a debug message. It is guarded by NDEBUG, so it would not be
> printed if rocr were built in Release mode. There is a bit of discussion
> upstream as to whether the debug_print should instead be guarded by an
> environment variable rather than a preprocessor definition.

> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
> the KConfig for "Enable HMM-based shared virtual memory manager", which
> is required for xnack+ operation. The xnack feature allows some AMD GPUs
> to retry memory accesses that fail due to a page fault, which is used as
> a mechanism for migrating managed memory automatically from host to
> device. With xnack disabled, page faults in device code are not
> recoverable [1].

I've rebuilt our kernel with this option enabled, and the message indeed
went away. Great!

This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?

That did remind me of another message I've seen in dmesg, repeated a
few dozen times, when some (but not all) tests are run:

    amdgpu: init_user_pages: Failed to get user pages: -1

rocrand is a good example where these occur.

Despite the failure, I did not observe any negative side effects, but
the above change also did not solve this. Have you seen this message in
dmesg as well?

Best,
Christian


Reply to: