[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Building rocRAND with Debian HIP



Hi Étienne,

On 2022-06-15 15:52, Étienne Mollier wrote:
After triggering the test suite of rocrand, I see most tests
failing with the following error messages show, e.g. test 23:

	23: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/test/test_rocrand_xorwow_prng
	23: Test timeout computed to be: 10000000
	23: Running main() from ./googletest/src/gtest_main.cc
	23: [==========] Running 8 tests from 1 test suite.
	23: [----------] Global test environment set-up.
	23: [----------] 8 tests from rocrand_xorwow_prng_tests
	23: [ RUN      ] rocrand_xorwow_prng_tests.init_test
	23: LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory
	23: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
	"
	23/29 Test #23: test_rocrand_xorwow_prng ............Subprocess aborted***Exception:   0.09 sec
In past discussion, I understood the libhsa-amd-aqlprofile64.so
should be benign (or should otherwise be skipped), so I believe
the issue mainly results from the hipErrorNoBinaryForGpu.

Correct. The libhsa-amd-aqlprofile64.so warning can be ignored. I saw the same warning last night when I was testing on my Radeon VII, but in my case all tests passed.

The hipErrorNoBinaryForGpu error is the problem causing your test failure. There are a number of 'code objects' (ELF files) embedded in librocrand.so. IIRC, when you launch a kernel, the HIP AMD runtime will load the corresponding code objects for that function for each AMD GPU installed in your system. If it can't find one (e.g., because the library was built for gfx906 but your hardware is gfx803), then you will encounter that error.

It would be nice if HIP printed more information by default, but you can get it to emit more details about what it was looking for by setting the AMD_LOG_LEVEL environment variable [1]. Something like export AMD_LOG_LEVEL=4 should do the trick.

You can also use roc-obj and roc-obj-ls to inspect the code objects within a binary [2].

I tried various things in rocrand and rocm-hipamd to attempt to
enable the gfx803 architecture, as I was under the impression
that the existing packaging was mainly targetting gfx906, but
subsequent build attempts failed with:

	clang: error: invalid target ID 'gfx803:xnack-'; format is a processor name followed by an optional colon-delimited list of features followed by an enable/disable sign (e.g., 'gfx908:sramecc+:xnack-')

which sounds rather odd since the format specified in the
--offload-arch argument looks to match the textual
specification.

The gfx803 architecture doesn't support the xnack feature, so it's an error to include it in the target id. You can see which architectures support xnack or sramecc by checking the AMDGPU Processors table in the LLVM documentation [3].

To compile for gfx803 the CMake argument -DAMDGPU_TARGETS=gfx803 is sufficient. If left unset, roc{RAND,PRIM,BLAS,SOLVER,SPARSE,FFT} will compile for a default set of architectures. In the case of rocRAND, that would be gfx803, gfx900:xnack-, gfx906:xnack-, gfx908:xnack-, gfx90a:xnack-, gfx90a:xnack+ and gfx1030 [4]. Those are the architectures that are used for the binaries that AMD distributes directly.

Sincerely,
Cory Bloor

[1]: https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-5.0.2/docs/markdown/hip_logging.md#hip-logging-level [2]: https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-5.0.2/docs/markdown/obj_tooling.md
[3]: https://llvm.org/docs/AMDGPUUsage.html#processors
[4]: https://salsa.debian.org/rocm-team/rocrand/-/blob/487281d66e850c9cc9c8a2dbe60fdca3e29c98a5/CMakeLists.txt#L95


Reply to: