[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests



Hi Cory,

Cordell Bloor, on 2023-07-10:
> On 2023-07-09 14:43, Étienne Mollier wrote:
> > Cordell Bloor, on 2023-07-07:
> > > I've added a librocthrust-tests package. This is quite similar to
> > > librocprim-tests.
> > Hmn, I have no luck with this one.  The package built fine,
> > including the build time checks, given that I exposed the gpu.
> > But when I ran the autopkgtest suite, one of the tests caused a
> > gpu reset.
> 
> Au contraire, that is a great success. It is my understanding that it should
> not be possible for a normal program to cause a GPU reset. This is therefore
> not a bug in rocthrust, but rather an indication of a problem in some other
> component of the test system. It could be a hardware problem or a software
> problem. One possibility would be a bug in the amdgpu driver.
> 
> This is exactly the sort of thing that the autopkgtests exist to catch. I'm
> hoping that once we get this CI system enabled, we will be able to file some
> high-quality bug reports against the Linux kernel.

Good point, I'm wrapping up a bug report our the distribution
kernel.  It's not sent yet as I'd like to run a few more tests
to complete my report.

> >    Excerpt from dmesg:
> > 
> > 	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=8914355, emitted seq=8914357
> > 	[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 625332 thread Xorg:cs0 pid 625333
> > 
> > I guess I probably should retry outside graphical context, to
> > avoid interferring with the test suite.  It might be helpful to
> > double check how things go on another card than RX 6800, or
> > another similar model.  Could someone check?
> 
> I haven't reproduced your exact setup, but the build tests all passed on my
> Radeon VII.

Thanks for checking!  So far, I think I isolated
test_thrust_set_difference, as it very much stresses the gpu,
but I haven't seen it finish in autopkgtest context yet.

Now I'm a bit bugged, because the build tests all passed on my
end before I ran the autopkgtest (and timing information
suggests all SetDifference related tests lastet a only a couple
of seconds), but the autopkgtest proper collided on the Xorg
server (at least once but I haven't retried such configuration
yet), or ran for dozens of minutes without giving an impression
of moving forward.  I don't exclude the possibility that an
implementation detail of the autopkgtest is interferring with
the run for that very test, but I'm not sure what it could be
yet.  Or there is something else I'm completely missing.

Have a nice day,  :)
-- 
  .''`.  Étienne Mollier <emollier@debian.org>
 : :' :  gpg: 8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
 `. `'   sent from /dev/tty1, please excuse my verbosity
   `-

Attachment: signature.asc
Description: PGP signature


Reply to: