[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Question about how to handle HIP vs hipamd



Hi there,

Jeremy Newton, on 2022-05-17:
> I assume the Debian package excludes .hipInfo and .hipVersion? I was looking into where to move those files, but I was more focused on the patches for the ROCclr and HIP, since those are bigger blockers for Fedora.

Normally, at build-time testing, they are still around, but yes,
they are not part of any binary packages for the moment.

> On May 17, 2022 12:18:18 a.m. EDT, Cordell Bloor <cgmb-deb@slerp.xyz> wrote:
>>It's a beautiful day, Étienne.

I'm glad to read you had such a day.  :)

>>It seems the problem is the HIP version detection. Without a .hipInfo file, clang doesn't know what version of HIP it is using. It skips the wrapper because it thinks that HIP is too old. The simplest fix for this is probably to explicitly pass --hip-version=5.0.0 to clang. I used the HIPCC_COMPILE_FLAGS_APPEND and HIPCC_LINK_FLAGS_APPEND environment variables for that purpose.

Thanks for the tip!  I injected these arguments in the build
procedure and it went through.

>>I encountered one final linking error in a test and resolved it by installing librocmsmi64-1. Then, I had a successful build!

I hit a couple of minor issues in the packaging of the Debian
librocm-smi-dev binary package, so stalled a wee bit longer than
necessary on finishing the build, but that is worked around now.

>>...though, I haven't actually tried running the tests on a machine with a GPU.

I tried to run the test suite in an schroot exposing GPU devices
and the result was… interesting.  I ran a backup of the host,
and then started the test suite, and after some time the system
ended up stuck, some CPU cores reporting hard lockups.  After a
couple of kernel traces, I ended up doing a hard reset of the
host.  To my surprise, the build and test log reached the end
and reported the following, which looks not too bad:

	92% tests passed, 32 tests failed out of 408
	
	Total Test time (real) = 3097.66 sec
	
	The following tests did not run:
		 96 - directed_tests/g++/hipMalloc_cxx_amd.tst (Skipped)
	
	The following tests FAILED:
		 99 - directed_tests/hiprtc/hiprtcGetLoweredName.tst (SEGFAULT)
		100 - directed_tests/hiprtc/saxpy.tst (SEGFAULT)
		101 - directed_tests/ipc/hipMultiProcIpcEvent.tst (Timeout)
		102 - directed_tests/ipc/hipMultiProcIpcMem.tst (Subprocess aborted)
		111 - directed_tests/kernel/hipPrintfKernel.tst (Timeout)
		113 - directed_tests/kernel/hipShflUpDownTest.tst (Timeout)
		121 - directed_tests/printf/hipPrintfAltForms.tst (Timeout)
		122 - directed_tests/printf/hipPrintfBasic.tst (Timeout)
		123 - directed_tests/printf/hipPrintfFlags.tst (Timeout)
		124 - directed_tests/printf/hipPrintfManyDevices.tst (Timeout)
		125 - directed_tests/printf/hipPrintfManyWaves.tst (Timeout)
		126 - directed_tests/printf/hipPrintfSpecifiers.tst (Timeout)
		127 - directed_tests/printf/hipPrintfStar.tst (Timeout)
		128 - directed_tests/printf/hipPrintfWidthPrecision.tst (Timeout)
		197 - directed_tests/runtimeApi/memory/hipIpcMemAccessTest.tst (Timeout)
		249 - directed_tests/runtimeApi/memory/hipMemcpyNegativeMThrdMSize_MultiSize_singleType.tst (Timeout)
		281 - directed_tests/runtimeApi/memory/hipMemsetAsyncAndKernel.tst (Timeout)
		282 - directed_tests/runtimeApi/memory/hipMemsetAsyncMultiThread.tst (Timeout)
		296 - directed_tests/runtimeApi/module/hipExtModuleLaunchKernel_CornerScenarios.tst (Timeout)
		307 - directed_tests/runtimeApi/module/hipModuleLaunchKernel--tests0x2.tst (Timeout)
		380 - directed_tests/runtimeApi/stream/hipMultiStreams.tst (Timeout)
		381 - directed_tests/runtimeApi/stream/hipNullStream.tst (Timeout)
		382 - directed_tests/runtimeApi/stream/hipStreamACb_AltEnqueue.tst (Timeout)
		386 - directed_tests/runtimeApi/stream/hipStreamACb_StrmSyncTiming.tst (Subprocess aborted)
		394 - directed_tests/runtimeApi/stream/hipStreamGetPriority.tst (Timeout)
		395 - directed_tests/runtimeApi/stream/hipStreamL5.tst (Timeout)
		396 - directed_tests/runtimeApi/stream/hipStreamSync2.tst (Subprocess aborted)
		397 - directed_tests/runtimeApi/stream/hipStreamWithCUMask.tst (Subprocess aborted)
		398 - directed_tests/runtimeApi/streamOperations/hipstream_operations.tst (Subprocess aborted)
		399 - directed_tests/runtimeApi/synchronization/cache_coherency_cpu_gpu.tst (Subprocess aborted)
		400 - directed_tests/runtimeApi/synchronization/cache_coherency_gpu_gpu.tst (Subprocess aborted)
		401 - directed_tests/runtimeApi/synchronization/copy_coherency.tst (Subprocess aborted)

Most of the tests fail in timeout, maybe it's not much.  Tests
which aborted might have been at the time the machine was
beginning to crash.  The two segmentation faults in tests #99
and #100 might be of concern, I caught the following around
these, so maybe a library to get packaged in Debian too:

	99: LoadLib(libhsa-amd-aqlprofile64.so) failed: libhsa-amd-aqlprofile64.so: cannot open shared object file: No such file or directory

Tests without an AMD GPU exposed to the build environment mostly
fail due to invalid device.  I consider building the test suite
but not run it in such case of absence of GPU, which would be
the ordinary buildd setup.

I stick to the Debian native Linux kernel and its integrated
amdgpu driver, for all it's worth:

	$ uname -srv
	Linux 5.17.0-2-amd64 #1 SMP PREEMPT Debian 5.17.6-1 (2022-05-11)

I attached the build log with the RX560 (gfx803), and extracts
of the kernel log, mostly for the curious.

Have a nice day,  :)
-- 
Étienne Mollier <emollier@emlwks999.eu>
Fingerprint:  8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
Sent from /dev/pts/2, please excuse my verbosity.
On air: Jonas Lindberg & The Other Side - Miles From Nowhere, Pt

Attachment: amdgpu.kern.log.xz
Description: application/xz

Attachment: rocm-hipamd_5.0.0-1~exp1_amd64_gfx803.build.xz
Description: application/xz

Attachment: signature.asc
Description: PGP signature


Reply to: