gpuenv-utils with multiple podman workers
Hi Christian,
I have acquired a few more RDNA1 GPUs to ensure that we have at least
two workers for each of gfx1010, gfx1011, and gfx1012 on
ci.rocm.debian.net. To achieve this, I'm trying to increase the number
of podman workers to two or three per host, but I'm running into a problem.
The pretest acquires all GPUs on the system and attempts to lock them
with gpuenv-utils. The first worker locks both GPUs and then the second
worker then times out when it tries to do the same. This locking is done
by /usr/share/debci/util/pre-test. I could remove the call to pre-test,
but the health check is very useful for preventing a broken worker node
from consuming the entire job queue and reporting every job as failed.
To control access to the GPUs, I've set environment variables in the
autopkgtest arguments for each worker. For the first worker, I use
--env=ROCR_VISIBLE_DEVICES=0 and for the second worker I use
--env=ROCR_VISIBLE_DEVICES=1. I suppose I would also need to communicate
this restriction to the pretest somehow.
Sincerely,
Cory Bloor
Reply to: