RFC: Strategy for extending our CI to Ubuntu and other distributions

To: Debian ROCm Team <debian-ai@lists.debian.org>
Subject: RFC: Strategy for extending our CI to Ubuntu and other distributions
From: Christian Kastner <ckk@debian.org>
Date: Thu, 11 Jul 2024 23:53:41 +0200
Message-id: <[🔎] 60186f7d-9605-46a2-8e2f-51f2c59b4532@debian.org>

Hi all,

with the upload of debci_3.10+rocm4, gpuenv-utils_0.1.5, and some other
packages, I have finished my push towards getting other distributions,
most importantly from Ubuntu, to be testable through our CI. Everything
should be in place now.

However, the deployment requires making some design decisions for which
I'd appreciate input. There is no rush, though.

There are two major decisions to make:
   I. (a) Single master for everything, or (b) multiple masters
  II. Worker instances with (a) exclusive or (b) cooperative GPU access


I.a Single master
=================

Having a single master would mean that all distributions would be
managed through ci.rocm.debian.net. As a demonstrator, I implemented
this in our dev environment [ci-dev.rocm.debian.net], where one can run
tests in unstable and noble.

Upside: there is just one master to manage.

Downside: there is no granular control over jobs in the AMQP queues. We
cannot put on hold or purge only Ubuntu jobs should the need arise.

Consequently, with podman-backed workers, there is no way to tell a
worker running on a host with a Debian kernel to skip an Ubuntu job and
leave it to a worker on an Ubuntu-kernel running host.

Downside: all workers connected to the master must be able to handle all
distributions known to the master. So adding a distribution, or
debugging issues with one, requires coordination across all our infra.


I.b Multiple masters
====================

We run multiple independent masters, eg:
         ci.rocm.debian.net
  ubuntu.ci.rocm.debian.net
    kali.ci.rocm.debian.net

Upside: this would give us somewhat granular control over jobs, and
infra changes could be done in smaller chunks.

Downside: a bit of cost (currently $60 p.a. per master, so nothing
tragic), and a bit of extra admin work.

We actually already use this pattern for our dev/test/prod environments
ci-dev/ci-test/ci.rocm.debian.net. The question is, do we want to expand
this.


II.a Workers w/ exclusive GPU access
====================================

With the standard debci execution driver, a host can have multiple GPUs
and multiple worker instances, but every GPU can be assigned to at most
one worker instance.

Upside: we have been running this for a year and it works very well.

Downside: each worker instance can only listen to exactly one queue on
one master. So in a multiple master setting, we'd either need more GPUs
or (more reasonable) some new agent that rotates active worker instances
that use the same GPU.

Downside: GPUs cannot be used for other purposes, like porterboxes.


II.b Workers w/ cooperative GPU access
======================================

In debci_3.10+rocm4, I introduced a new execution driver that
implemented cooperative GPU access. Worker instances needing a GPU for a
job will now simply wait for the GPU to become free.

Upside: On one host, one can start an arbitrary number N of worker
instances. In a multiple master setting, one physical GPU could be used
by N worker instances connecting to M masters. This would maximize
utilization.

Upside: GPUs can also be used by porterbox VMs.

Downside: this hasn't been battle-hardened yet. I've done lots of
testing, but that is no substitute for real-world use.


My own thoughts
===============

For goal I, I'd lean towards (b) multiple masters, as the
everything-in-one-queue approach of (a) gives off a bit of a foot-gun
vibe to me, but maybe I'm just overly cautious.

For goal II, I'd lean towards running the (b) cooperative execution
driver in ci-test with some real-world workloads, as I think cooperative
will be inevitable at some point, specifically when we want to provide
porterbox environments. Iff it works out, we can phase it into production.

Best,
Christian

Reply to:

Prev by Date: [bts-link] source package src:onnx
Next by Date: Bug#1076183: pytorch: please rebuild against flatbuffers
Previous by thread: Processed: [bts-link] source package src:onnx
Next by thread: Bug#1076183: pytorch: please rebuild against flatbuffers
Index(es):
- Date
- Thread