[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1028957: librocrand-dev: rocrand_INCLUDE_DIR does not exist



Hi Cory,

Cordell Bloor, on 2023-01-18:
> I've updated the rocrand package sources on Salsa to rocrand 5.3.3 and
> transformed it into a MUT package. I've confirmed that the resulting library
> works correctly using it to configure rocfft 5.4.2 (which was how I
> discovered this bug originally).

Thanks for this, I begun doing the same yesterday, but shifted
my attention to rocm-hipamd for the reason you mention below.

> rocrand 5.3.3-1 just needs three things to be ready for upload:
> 
> 1. The d/copyright file needs to be updated for the new version.

Acknowledged, note for later: notably there is the inclusion of
the hipRAND directory to check.  (I'll be travelling this week
end so won't be much reactive until next week.)

> 2. The symbol tracking needs to be reviewed by somebody more experienced
> than me. I think that anything in the rocrand::detail or
> rocrand_host::detail namespace should be marked optional, as those symbols
> are not intended for use by library users.

Ideally they should not be exposed (by the mean the build flag
-fvisibility=hidden allows, but I'm not sure of implementation
details on upstream side to be honest).  If the symbols are not
part of the public interface but still referenced, but we are
sure they are unused by reverse dependencies, they probably can
be marked (optional).  The library soversion suggests the stable
part of the ABI should not have had a breakage, so I guess the
(optional) marker is fine.

> 3. The rocm-hipamd 5.2.3-3 package needs to be uploaded or
> libclang-rt-15-dev must be added to the rocrand build dependencies.

I wanted to take that opportunity to stabilize the test suite of
rocm-hipamd, but I'm currently failing on:

	test 103
	        Start 103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst
	
	103: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/ipc/hipMultiProcIpcMem " " "--N" "4"
	103: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
	103: Environment variables: 
	103:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
	103: Test timeout computed to be: 1500
	103: KFD does not support xnack mode query.
	103: ROCr must assume xnack is disabled.
	103: error: 'hipErrorInvalidDevicePointer'(17) from hipIpcGetMemHandle(&ipc_handle, ipc_offset_dptr) at /<<PKGBUILDDIR>>/hip/tests/src/ipc/hipMultiProcIpcMem.cpp:55
	103: error: API returned error code.
	103: error: TEST FAILED
	103: 
	103/414 Test #103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst .......................................................................................Subprocess aborted***Exception: 792.07 sec

A later test then crashes:

	test 126
	        Start 126: directed_tests/printf/hipPrintfManyWaves.tst
	
	126: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/printf/hipPrintfManyWaves " "
	126: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
	126: Environment variables: 
	126:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
	126: Test timeout computed to be: 1500
	126: KFD does not support xnack mode query.
	126: ROCr must assume xnack is disabled.
	126: Memory access fault by GPU node-1 (Agent handle: 0x562e8fbb1e00) on address (nil)(may not be exact address). Reason: DRAM ECC failure.
	126: Nearby memory map:
	126: 0x7f5497800000, 0x78c000, System
	126: 0x7f549ac00000, 0x100000, VRAM
	126: 0x7f549af00000, 0x80000, System
	126: 
	126: PtrInfo:
	126:    Address: 0x7f5497800000-0x7f5497f8c000/0x7f5497800000-0x7f5497f8c000
	126:    Size: 0x78c000
	126:    Type: 1
	126:    Owner: 0x562e8fbac4b0
	126:    CanAccess: 1
	126:            0x562e8fbb1e00
	126:    In block: 0x7f5497800000, 0x78c000
	126: PtrInfo:
	126:    Address: 0x7f549ac00000-0x7f549ad00000/0x7f549ac00000-0x7f549ad00000
	126:    Size: 0x100000
	126:    Type: 1
	126:    Owner: 0x562e8fbb1e00
	126:    CanAccess: 1
	126:            0x562e8fbb1e00
	126:    In block: 0x7f549ac00000, 0x200000
	126: PtrInfo:
	126:    Address: 0x7f549af00000-0x7f549af80000/0x7f549af00000-0x7f549af80000
	126:    Size: 0x80000
	126:    Type: 1
	126:    Owner: 0x562e8fbac4b0
	126:    CanAccess: 1
	126:            0x562e8fbb1e00
	126:    In block: 0x7f549af00000, 0x80000
	126: hipPrintfManyWaves: ./src/core/runtime/runtime.cpp:1276: static bool rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false && "GPU memory access fault."' failed.
	126/414 Test #126: directed_tests/printf/hipPrintfManyWaves.tst ........................................................................................Subprocess aborted***Exception:   0.64 sec

About at the same time as #126 I get a kernel NULL pointer
dereference:

	amdgpu: sq_intr: error, se 2, data 0x25, sh 0, priv 0, wave_id 0, simd_id 0, cu_id 0, err_type 4
	amdgpu 0000:0b:00.0: amdgpu: RAS poison consumption, unmap queue flow succeeded: client id 10
	BUG: kernel NULL pointer dereference, address: 00000000000001b0
	#PF: supervisor write access in kernel mode
	#PF: error_code(0x0002) - not-present page
	PGD 0 P4D 0 
	Oops: 0002 [#1] PREEMPT SMP NOPTI
	CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 6.1.0-1-amd64 #1  Debian 6.1.4-1
	Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F3 09/04/2019
	Workqueue: KFD IH interrupt_wq [amdgpu]
	RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
	Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
	RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
	RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
	RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
	RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
	R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
	R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
	FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0
	Call Trace:
	 <TASK>
	 smu_get_ecc_info+0x1f/0x30 [amdgpu]
	 amdgpu_dpm_get_ecc_info+0x39/0x60 [amdgpu]
	 amdgpu_umc_do_page_retirement.constprop.0+0x38/0x170 [amdgpu]
	 amdgpu_umc_poison_handler+0x64/0xb0 [amdgpu]
	 amdgpu_amdkfd_ras_poison_consumption_handler+0x48/0x70 [amdgpu]
	 interrupt_wq+0xcf/0x120 [amdgpu]
	 process_one_work+0x1c7/0x380
	 worker_thread+0x4d/0x380
	 ? _raw_spin_lock_irqsave+0x23/0x50
	 ? rescuer_thread+0x3a0/0x3a0
	 kthread+0xe9/0x110
	 ? kthread_complete_and_exit+0x20/0x20
	 ret_from_fork+0x22/0x30
	 </TASK>
	Modules linked in: overlay cpufreq_userspace cpufreq_powersave cpufreq_ondemand cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd snd_hda_codec_realtek kvm snd_hda_codec_generic ledtrig_audio irqbypass snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 gpu_sched snd_hda_intel sha512_generic snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi video snd_hda_codec drm_display_helper snd_hda_core cec rc_core snd_hwdep aesni_intel snd_pcm drm_ttm_helper crypto_simd ttm cryptd snd_timer drm_kms_helper gigabyte_wmi rapl snd pcspkr ccp wmi_bmof i2c_algo_bit sp5100_tco watchdog k10temp soundcore rng_core evdev button acpi_cpufreq sg parport_pc ppdev lp drm parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs btrfs zstd_compress raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod xor raid6_pq libcrc32c crc32c_generic dm_mod sd_mod hid_generic usbhid hid ahci nvme
	 xhci_pci libahci xhci_hcd nvme_core libata r8169 t10_pi realtek mdio_devres crc32_pclmul crc64_rocksoft crc32c_intel crc64 usbcore libphy crc_t10dif scsi_mod i2c_piix4 crct10dif_generic scsi_common usb_common crct10dif_pclmul crct10dif_common wmi
	CR2: 00000000000001b0
	---[ end trace 0000000000000000 ]---
	RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
	Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
	RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
	RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
	RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
	RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
	R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
	R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
	FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0

I'm redoing a build without running the test suite for upload,
but I had to forcefully reboot the workstation, so this doesn't
feel ideal for now.  Thankfully this doesn't seem to affect
reverse dependencies test suites as far as I could witness so
far.  The kernel version for ulterior reference:

	$ uname -srv
	Linux 6.1.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 (2023-01-07)

Have a nice day,  :)
-- 
Étienne Mollier <emollier@emlwks999.eu>
Fingerprint:  8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
Sent from /dev/tty1, please excuse my verbosity.
On air: Spock's Beard - When She's Gone

Attachment: signature.asc
Description: PGP signature


Reply to: