On 2023-07-11 23:04, Étienne Mollier wrote:
Thanks for checking! So far, I think I isolated
test_thrust_set_difference, as it very much stresses the gpu,
but I haven't seen it finish in autopkgtest context yet.
I can confirm that test_thrust_set_difference also hangs on my end with
an RX 6800 XT, not just when driven by autopkgtest with the QEMU backend
but also on bare metal.
The test hangs with one CPU core at 100% load, here is the backtrace:
#0 rocr::core::InterruptSignal::WaitRelaxed (this=0x55b5c6c88800, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1,
timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_ACTIVE) at ./src/core/runtime/interrupt_signal.cpp:198
#1 0x00007f0c6ca577ea in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>,
compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>)
at ./src/core/runtime/interrupt_signal.cpp:220
#2 0x00007f0c6ca4d5a7 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT,
compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_ACTIVE)
at ./src/core/runtime/hsa.cpp:1219
#3 0x00007f0c6d56d8bb in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#4 0x00007f0c6d57283e in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#5 0x00007f0c6d5a755b in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#6 0x00007f0c6d56fc3f in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#7 0x00007f0c6d53b4c6 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#8 0x00007f0c6d3beba3 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#9 0x00007f0c6d3c1fd8 in hipMemcpyWithStream () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#10 0x000055b5c3f6875d in ?? ()
#11 0x000055b5c3f3211f in ?? ()
#12 0x000055b5c3f2aed0 in ?? ()
#13 0x000055b5c3f5d5ff in ?? ()
#14 0x000055b5c3f6b615 in ?? ()
#15 0x000055b5c3f6c6cb in ?? ()
#16 0x000055b5c3f72079 in ?? ()
#17 0x000055b5c3f6cd0c in ?? ()
#18 0x00007f0c6d04618a in __libc_start_call_main (main=main@entry=0x55b5c3f6c800, argc=argc@entry=1,
argv=argv@entry=0x7fff74c21ef8) at ../sysdeps/nptl/libc_start_call_main.h:58
#19 0x00007f0c6d046245 in __libc_start_main_impl (main=0x55b5c3f6c800, argc=1, argv=0x7fff74c21ef8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff74c21ee8) at ../csu/libc-start.c:381
#20 0x000055b5c3f29921 in ?? ()
I think this is a particularly interesting case because we need to
figure out how to deal with such a case in our infra.
Some suites need hours to run, but if we set the general timeout to
three hours or so, that test above will just block three hours. One way
to solve this is by extending autopkgtest to honor a package-spefic
timeout hint in debian/control (X-Autopkgtest-Timeout: 30m or similar).