[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests



Hi all,

On 2023-07-11 23:04, Étienne Mollier wrote:
>> Au contraire, that is a great success. It is my understanding that it should
>> not be possible for a normal program to cause a GPU reset. This is therefore
>> not a bug in rocthrust, but rather an indication of a problem in some other
>> component of the test system. It could be a hardware problem or a software
>> problem. One possibility would be a bug in the amdgpu driver.
>>
>> This is exactly the sort of thing that the autopkgtests exist to catch. I'm
>> hoping that once we get this CI system enabled, we will be able to file some
>> high-quality bug reports against the Linux kernel.

I'm really behind on this, but it's turtles upon turtles [1]. I think
it's time for me to decompose the problem into smaller ones and share
results per sub-problem, perhaps I'm overthinking some of the problems.

But yes, I fully agree - bugs like that is something that our infra
should automatically be able to catch.

> Thanks for checking!  So far, I think I isolated
> test_thrust_set_difference, as it very much stresses the gpu,
> but I haven't seen it finish in autopkgtest context yet.
> 
> Now I'm a bit bugged, because the build tests all passed on my
> end before I ran the autopkgtest (and timing information
> suggests all SetDifference related tests lastet a only a couple
> of seconds), but the autopkgtest proper collided on the Xorg
> server (at least once but I haven't retried such configuration
> yet), or ran for dozens of minutes without giving an impression
> of moving forward.  I don't exclude the possibility that an
> implementation detail of the autopkgtest is interferring with
> the run for that very test, but I'm not sure what it could be
> yet.  Or there is something else I'm completely missing.

I can confirm that test_thrust_set_difference also hangs on my end with
an RX 6800 XT, not just when driven by autopkgtest with the QEMU backend
but also on bare metal.

The test hangs with one CPU core at 100% load, here is the backtrace:

> #0  rocr::core::InterruptSignal::WaitRelaxed (this=0x55b5c6c88800, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, 
>     timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_ACTIVE) at ./src/core/runtime/interrupt_signal.cpp:198
> #1  0x00007f0c6ca577ea in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, 
>     compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>)
>     at ./src/core/runtime/interrupt_signal.cpp:220
> #2  0x00007f0c6ca4d5a7 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, 
>     compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_ACTIVE)
>     at ./src/core/runtime/hsa.cpp:1219
> #3  0x00007f0c6d56d8bb in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #4  0x00007f0c6d57283e in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #5  0x00007f0c6d5a755b in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #6  0x00007f0c6d56fc3f in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #7  0x00007f0c6d53b4c6 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #8  0x00007f0c6d3beba3 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #9  0x00007f0c6d3c1fd8 in hipMemcpyWithStream () from /lib/x86_64-linux-gnu/libamdhip64.so.5
> #10 0x000055b5c3f6875d in ?? ()
> #11 0x000055b5c3f3211f in ?? ()
> #12 0x000055b5c3f2aed0 in ?? ()
> #13 0x000055b5c3f5d5ff in ?? ()
> #14 0x000055b5c3f6b615 in ?? ()
> #15 0x000055b5c3f6c6cb in ?? ()
> #16 0x000055b5c3f72079 in ?? ()
> #17 0x000055b5c3f6cd0c in ?? ()
> #18 0x00007f0c6d04618a in __libc_start_call_main (main=main@entry=0x55b5c3f6c800, argc=argc@entry=1, 
>     argv=argv@entry=0x7fff74c21ef8) at ../sysdeps/nptl/libc_start_call_main.h:58
> #19 0x00007f0c6d046245 in __libc_start_main_impl (main=0x55b5c3f6c800, argc=1, argv=0x7fff74c21ef8, init=<optimized out>, 
>     fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff74c21ee8) at ../csu/libc-start.c:381
> #20 0x000055b5c3f29921 in ?? ()

I think this is a particularly interesting case because we need to
figure out how to deal with such a case in our infra.

Some suites need hours to run, but if we set the general timeout to
three hours or so, that test above will just block three hours. One way
to solve this is by extending autopkgtest to honor a package-spefic
timeout hint in debian/control (X-Autopkgtest-Timeout: 30m or similar).

Best,
Christian

[1] For example, our infra will need modified versions of official
packages for some time -- until we are ready to upstream our changes.
Which means we need an APT server. Which means we need an ansible config
for a reprepro + apache2 setup, and a way to securely sign Release files
(I'm using a Nitrokey), and a way manage upload permissions via ansible,
and so on. Well, at least everyone can have their ppa now, if they want one.


Reply to: