[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: gpuenv-utils with multiple podman workers



Hi Cory,

On 2025-07-21 12:30, Cordell Bloor wrote:
> The pretest acquires all GPUs on the system and attempts to lock them
> with gpuenv-utils.

Side note: this is because the podman debci workers don't support --gpu
arguments, like the QEMU workers do, and thus cannot deduce which
devices to lock. Consequently, they default to "all".

> The first worker locks both GPUs and then the second worker then
> times out when it tries to do the same. This locking is done by /usr/
> share/debci/util/pre-test. I could remove the call to pre-test, but
> the health check is very useful for preventing a broken worker node 
> from consuming the entire job queue and reporting every job as
> failed.

You can override the hook in the debci worker's config by setting
debci_pretest_hook="<path>" to a worker-custom script.

For your custom script, you could copy the default pre-test [2] starting
from line 16, and replace gpus="<...>" assignment with a comma-separated
list of the slots that your worker uses. You'll have to figure out on
your own how the slots map to ROCR_VISIBLE_DEVICES=<index>.

> To control access to the GPUs, I've set environment variables in the
> autopkgtest arguments for each worker. For the first worker, I use --
> env=ROCR_VISIBLE_DEVICES=0 and for the second worker I use --
> env=ROCR_VISIBLE_DEVICES=1. I suppose I would also need to communicate
> this restriction to the pretest somehow.

Theoretically it should be easy to add --gpu support to the podman+rocm
backend even with envvars, but this would present a layer violation from
autopkgtest's design POV (envvars are in the driver, not the backend).

I'm sure this is still doable but I'm prioritizing other stuff, so in
case anyone else wants to take a stab, go ahead.

Somewhere further down the road: from discussions at DebConf, it became
clear that to improve debci worker scheduling in general -- that is,
beyond our ROCm CI -- I will need to extend this utility to also support
CPU cores and memory (possibly through cgroups?), and also accelerators
from other manufacturers.

Best,
Christian

[1]: https://salsa.debian.org/rocm-team/debci/-/blob/rocm-fork/util/pre-test?ref_type=heads#L16
[2]: https://salsa.debian.org/rocm-team/gpuenv-utils#examples


Reply to: