[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFS: rocthrust/5.3.3-4~exp1 -- ROCm parallel algorithms library - tests



On 2023-07-12 22:53, Cordell Bloor wrote:
> Hi Étienne and Christian,
> 
> I'm a bit unclear on our policy with these sorts of issues. Should this
> be blocking the entry of rocthrust-tests into experimental?

No, experimental is fine, or even unstable.

Failing _autopkgtests_ are blockers for the (automatic) migration from
unstable to testing.

Maintainers can elect to skip failing tests, to allow migration to
stable, because one failing test can block the entire source package
from migrating, and that's annoying if the test only fails in a very
rare way, or on some niche architecture, e.g. [3].

> The tests are failing (and in a bad way), but they're just the
> messenger. They exercised the system and revealed a preexisting
> problem. To me, that seems valuable to end-users. It's a lot faster
> to discover that some rocthrust function will cause problems on your
> GPU by installing and running the test suite vs. being half-way
> through writing your own program using librocthrust-dev and
> discovering it via your own code.

I emphasized _autopkgtests_ above because you are of course right: these
particular tests (being hardware-dependent) have substantial value to
end users, so it would make sense to allow them to migrate.

So to get around the autopkgtest migration condition, what I would do
eventually in debian/tests/upstream-binaries is to skip tests if a GPU
is detected for which we can expect failure.

> Their consensus was that the two most important things are to provide a
> reproducer (i.e., a clearly described method to reproduce the problem)
> and the full dmesg log (as it contains lots of info like kernel version,
> VBIOS version, etc).

That makes sense, and should be doable. I guess one would submit this to
the amd-gfx list [4], right? Or do they have their own tracker
somewhere? (The list already helped me with [5])

I guess a hardware person would probably balk at this, but if a failure
on bare metal can be reproduced in a QEMU VM and a VM is OK, an image is
quickly shared, too.

> It's a shame that the debug symbols for libamdhip64 are not available,
> but the problem seems to be lower in the stack anyway. As a side note,
> there are a number of environment variables documented in the HIP
> Debugging Guide [1] that can be useful for debugging failures.

Thanks for the pointer!

> I've been talking to the recently formed Ubuntu HPC Team [2] on their
> matrix channel and they seem interested in testing ROCm packages for
> Ubuntu. I don't know enough about the Debian test infrastructure to be
> much of a go-between on this, but Jason Nucciarone is the primary
> developer that expressed interest. It might be worth reaching out.

Definitely, though we need to have our own infra up first, as they are
also constrained by at least problem (2) and (3) of [6]. Once we have a
viable solution, we can share.

Best,
Christian

> [1]: https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html
> [2]: https://discourse.ubuntu.com/t/high-performance-computing-team/35988

[3] https://lists.debian.org/debian-arm/2020/02/msg00076.html

[4] https://lists.freedesktop.org/archives/amd-gfx/

[5] https://lists.freedesktop.org/archives/amd-gfx/2023-June/094697.html

[6] https://lists.debian.org/debian-ai/2023/03/msg00038.html


Reply to: