[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests



Hi Étienne and Christian,

I'm a bit unclear on our policy with these sorts of issues. Should this be blocking the entry of rocthrust-tests into experimental? The tests are failing (and in a bad way), but they're just the messenger. They exercised the system and revealed a preexisting problem. To me, that seems valuable to end-users. It's a lot faster to discover that some rocthrust function will cause problems on your GPU by installing and running the test suite vs. being half-way through writing your own program using librocthrust-dev and discovering it via your own code.

With that said, I do see the other side. All software has latent bugs and surfacing them is a trade-off. A bug that may reset your GPU does seem rather serious, and making it easy to invoke that bug might not be desirable.

On 7/11/23 15:04, Étienne Mollier wrote:
Good point, I'm wrapping up a bug report our the distribution
kernel.  It's not sent yet as I'd like to run a few more tests
to complete my report.

I asked the developers of amdgpu/kfd for some tips on writing a good report.

Their consensus was that the two most important things are to provide a reproducer (i.e., a clearly described method to reproduce the problem) and the full dmesg log (as it contains lots of info like kernel version, VBIOS version, etc).

On 7/11/23 16:10, Christian Kastner wrote:
On 2023-07-11 23:04, Étienne Mollier wrote:
Thanks for checking!  So far, I think I isolated
test_thrust_set_difference, as it very much stresses the gpu,
but I haven't seen it finish in autopkgtest context yet.
I can confirm that test_thrust_set_difference also hangs on my end with
an RX 6800 XT, not just when driven by autopkgtest with the QEMU backend
but also on bare metal.

The test hangs with one CPU core at 100% load, here is the backtrace:

#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x55b5c6c88800, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1,
     timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_ACTIVE) at ./src/core/runtime/interrupt_signal.cpp:198
#1  0x00007f0c6ca577ea in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>,
     compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>)
     at ./src/core/runtime/interrupt_signal.cpp:220
#2  0x00007f0c6ca4d5a7 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT,
     compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_ACTIVE)
     at ./src/core/runtime/hsa.cpp:1219
#3  0x00007f0c6d56d8bb in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#4  0x00007f0c6d57283e in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#5  0x00007f0c6d5a755b in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#6  0x00007f0c6d56fc3f in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#7  0x00007f0c6d53b4c6 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#8  0x00007f0c6d3beba3 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#9  0x00007f0c6d3c1fd8 in hipMemcpyWithStream () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#10 0x000055b5c3f6875d in ?? ()
#11 0x000055b5c3f3211f in ?? ()
#12 0x000055b5c3f2aed0 in ?? ()
#13 0x000055b5c3f5d5ff in ?? ()
#14 0x000055b5c3f6b615 in ?? ()
#15 0x000055b5c3f6c6cb in ?? ()
#16 0x000055b5c3f72079 in ?? ()
#17 0x000055b5c3f6cd0c in ?? ()
#18 0x00007f0c6d04618a in __libc_start_call_main (main=main@entry=0x55b5c3f6c800, argc=argc@entry=1,
     argv=argv@entry=0x7fff74c21ef8) at ../sysdeps/nptl/libc_start_call_main.h:58
#19 0x00007f0c6d046245 in __libc_start_main_impl (main=0x55b5c3f6c800, argc=1, argv=0x7fff74c21ef8, init=<optimized out>,
     fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff74c21ee8) at ../csu/libc-start.c:381
#20 0x000055b5c3f29921 in ?? ()

It's a shame that the debug symbols for libamdhip64 are not available, but the problem seems to be lower in the stack anyway. As a side note, there are a number of environment variables documented in the HIP Debugging Guide [1] that can be useful for debugging failures.

I think this is a particularly interesting case because we need to
figure out how to deal with such a case in our infra.

Some suites need hours to run, but if we set the general timeout to
three hours or so, that test above will just block three hours. One way
to solve this is by extending autopkgtest to honor a package-spefic
timeout hint in debian/control (X-Autopkgtest-Timeout: 30m or similar).

I've been talking to the recently formed Ubuntu HPC Team [2] on their matrix channel and they seem interested in testing ROCm packages for Ubuntu. I don't know enough about the Debian test infrastructure to be much of a go-between on this, but Jason Nucciarone is the primary developer that expressed interest. It might be worth reaching out.

Sincerely,
Cory Bloor

[1]: https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html
[2]: https://discourse.ubuntu.com/t/high-performance-computing-team/35988


Reply to: