[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

gpuenv-utils with multiple podman workers



Hi Christian,

I have acquired a few more RDNA1 GPUs to ensure that we have at least two workers for each of gfx1010, gfx1011, and gfx1012 on ci.rocm.debian.net. To achieve this, I'm trying to increase the number of podman workers to two or three per host, but I'm running into a problem.

The pretest acquires all GPUs on the system and attempts to lock them with gpuenv-utils. The first worker locks both GPUs and then the second worker then times out when it tries to do the same. This locking is done by /usr/share/debci/util/pre-test. I could remove the call to pre-test, but the health check is very useful for preventing a broken worker node from consuming the entire job queue and reporting every job as failed.

To control access to the GPUs, I've set environment variables in the autopkgtest arguments for each worker. For the first worker, I use --env=ROCR_VISIBLE_DEVICES=0 and for the second worker I use --env=ROCR_VISIBLE_DEVICES=1. I suppose I would also need to communicate this restriction to the pretest somehow.

Sincerely,
Cory Bloor


Reply to: