[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RFC: Strategy for getting ROCm test coverage



Dear ROCm Team,

currently, ROCm-related packages only receive testing by the maintainer
preparing the package. Neither official build-time testing, nor official
debci autopkgtests are performed.

There are three main reasons for this, as far as I'm aware:

  (1) Most of our official infrastructure is not set up to allow for
      host-specific hardware. For example, buildds are frequently not
      physical hosts but VMs that need to be migrated across physical
      hosts.

  (2) There are also concerns regarding security and stability of
      drivers and hardware, in the sense of a potentially increased
      attack surface and/or more maintenance work.

  (3) Even if (1) and (2) were already solved, we still couldn't
      ensure tests are run on a capable worker because there is no way
      yet for packages to express a dependency on a GPU.

(1) and (2) are unfortunate, but no show-stoppers. A quick solution
would be to operate a dedicated debci infrastructure in which GPU
presence is guaranteed. Packages would continue to be built on the
official infra (without build-time tests), and the autopkgtests would be
run in this new debci environment.

=> This will not only help us detect bugs in our packages (eg:
#1032677), but also breakages in dependencies, dependents, drivers, etc.

(3) is a more challenging problem. We must devise a way to express a GPU
dependency in debian/control and debian/tests/control, so we may need a
new field, and a domain for that field (eg: gfx1010, gfx1030, etc.).
However, this isn't urgent.

As above, we'll simply assume GPU presence is guaranteed, and I'm sure
we'll figure out the rest as we go, once we get broader coverage both
package-wise and test-wise.


Proposal #1: Set up debci infrastructure
===========

I'm currently working on this. I have a headless server with an 6800 XT
which I will dedicate to this.

I initially focused my efforts on the autopkgtest-virt-podman driver,
but I've run into an odd issue that's blocking it from being usable, and
I actually wanted to use the autopkgtest-virt-qemu driver anyway.

Once I've got things ironed out, I'll share ansible roles so that other
workers (with other GPUs) can easily be added to the pool.


Proposal #2: Set up a new project "rocm-support"
============

To serve as a common base for scripts and utilities needed across
packages, and to hold documentation. Similar to what "nvidia-support" does.


Proposal #3: Draft ROCM autopkgtest specification guidelines
===========

Within "rocm-support", document how we specify out tests.

This isn't as simple as it sounds. We need to settle on which GPU
architectures to build for, in which environments to run (root, user,
both), even settle basic things like how to name and structure packages,
where to install the files to, and so on.

In the medium term, I assume that a path towards getting our changes
into main autopkgtest will crystallize.


Feedback to these three proposals and/or suggestions of more proposals
would be greatly appreciated.

Best,
Christian

PS: I've created an "autopkgtest" branch in src:rocrand as an example
for one of the packages I've been working on:

https://salsa.debian.org/rocm-team/rocrand/-/tree/autopkgtest


Reply to: