Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests

To: debian-ai@lists.debian.org
Subject: Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Wed, 12 Jul 2023 14:53:08 -0600
Message-id: <[🔎] 86efbb71-4873-ce90-b30e-2e5c35071c7e@slerp.xyz>
In-reply-to: <[🔎] 7a233cda-151c-9361-b370-b81e585d7acc@debian.org>
References: <[🔎] b4ae47a7-6bd7-d995-025d-5427f4c540fa@slerp.xyz> <ZKsb9lGZasmQ/U7M@fusion> <[🔎] 666d12f2-d193-5617-e8f1-eb5650c944c1@slerp.xyz> <[🔎] ZK3D0J7Yzm2f3CVs@fusion> <[🔎] 7a233cda-151c-9361-b370-b81e585d7acc@debian.org>

Hi Étienne and Christian,

I'm a bit unclear on our policy with these sorts of issues. Should thisbe blocking the entry of rocthrust-tests into experimental? The testsare failing (and in a bad way), but they're just the messenger. Theyexercised the system and revealed a preexisting problem. To me, thatseems valuable to end-users. It's a lot faster to discover that somerocthrust function will cause problems on your GPU by installing andrunning the test suite vs. being half-way through writing your ownprogram using librocthrust-dev and discovering it via your own code.

With that said, I do see the other side. All software has latent bugsand surfacing them is a trade-off. A bug that may reset your GPU doesseem rather serious, and making it easy to invoke that bug might not bedesirable.


On 7/11/23 15:04, Étienne Mollier wrote:

Good point, I'm wrapping up a bug report our the distribution
kernel.  It's not sent yet as I'd like to run a few more tests
to complete my report.


I asked the developers of amdgpu/kfd for some tips on writing a good report.

Their consensus was that the two most important things are to provide areproducer (i.e., a clearly described method to reproduce the problem)and the full dmesg log (as it contains lots of info like kernel version,VBIOS version, etc).


On 7/11/23 16:10, Christian Kastner wrote:

On 2023-07-11 23:04, Étienne Mollier wrote:

Thanks for checking!  So far, I think I isolated
test_thrust_set_difference, as it very much stresses the gpu,
but I haven't seen it finish in autopkgtest context yet.

I can confirm that test_thrust_set_difference also hangs on my end with
an RX 6800 XT, not just when driven by autopkgtest with the QEMU backend
but also on bare metal.

The test hangs with one CPU core at 100% load, here is the backtrace:

#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x55b5c6c88800, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1,
     timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_ACTIVE) at ./src/core/runtime/interrupt_signal.cpp:198
#1  0x00007f0c6ca577ea in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>,
     compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>)
     at ./src/core/runtime/interrupt_signal.cpp:220
#2  0x00007f0c6ca4d5a7 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT,
     compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_ACTIVE)
     at ./src/core/runtime/hsa.cpp:1219
#3  0x00007f0c6d56d8bb in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#4  0x00007f0c6d57283e in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#5  0x00007f0c6d5a755b in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#6  0x00007f0c6d56fc3f in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#7  0x00007f0c6d53b4c6 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#8  0x00007f0c6d3beba3 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#9  0x00007f0c6d3c1fd8 in hipMemcpyWithStream () from /lib/x86_64-linux-gnu/libamdhip64.so.5
#10 0x000055b5c3f6875d in ?? ()
#11 0x000055b5c3f3211f in ?? ()
#12 0x000055b5c3f2aed0 in ?? ()
#13 0x000055b5c3f5d5ff in ?? ()
#14 0x000055b5c3f6b615 in ?? ()
#15 0x000055b5c3f6c6cb in ?? ()
#16 0x000055b5c3f72079 in ?? ()
#17 0x000055b5c3f6cd0c in ?? ()
#18 0x00007f0c6d04618a in __libc_start_call_main (main=main@entry=0x55b5c3f6c800, argc=argc@entry=1,
     argv=argv@entry=0x7fff74c21ef8) at ../sysdeps/nptl/libc_start_call_main.h:58
#19 0x00007f0c6d046245 in __libc_start_main_impl (main=0x55b5c3f6c800, argc=1, argv=0x7fff74c21ef8, init=<optimized out>,
     fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff74c21ee8) at ../csu/libc-start.c:381
#20 0x000055b5c3f29921 in ?? ()

It's a shame that the debug symbols for libamdhip64 are not available,but the problem seems to be lower in the stack anyway. As a side note,there are a number of environment variables documented in the HIPDebugging Guide [1] that can be useful for debugging failures.

I think this is a particularly interesting case because we need to
figure out how to deal with such a case in our infra.

Some suites need hours to run, but if we set the general timeout to
three hours or so, that test above will just block three hours. One way
to solve this is by extending autopkgtest to honor a package-spefic
timeout hint in debian/control (X-Autopkgtest-Timeout: 30m or similar).

I've been talking to the recently formed Ubuntu HPC Team [2] on theirmatrix channel and they seem interested in testing ROCm packages forUbuntu. I don't know enough about the Debian test infrastructure to bemuch of a go-between on this, but Jason Nucciarone is the primarydeveloper that expressed interest. It might be worth reaching out.


Sincerely,
Cory Bloor

[1]:https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html

[2]: https://discourse.ubuntu.com/t/high-performance-computing-team/35988

Reply to:

Follow-Ups:
- Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Christian Kastner <ckk@debian.org>

References:
- RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Étienne Mollier <emollier@debian.org>
- Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Étienne Mollier <emollier@debian.org>
- Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: hipcub_5.3.3-4_source.changes ACCEPTED into unstable
Next by Date: Re: RFS: rocblas/5.5.1+dfsg-1~exp2 -- ROCm library for basic linear algebra
Previous by thread: Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
Next by thread: Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests
Index(es):
- Date
- Thread