Re: Testing ROCm on Everything

To: debian-ai@lists.debian.org, Cordell Bloor <cgmb@slerp.xyz>
Subject: Re: Testing ROCm on Everything
From: Christian Kastner <ckk@debian.org>
Date: Mon, 15 May 2023 22:48:20 +0200
Message-id: <[🔎] ad9b96ba-05c1-ee68-7377-b5b6ccb1c06a@debian.org>
In-reply-to: <[🔎] ba874f6d-586c-1d54-88c8-075adf9e3b26@slerp.xyz>
References: <f939aca6-27cf-13d3-754f-c5f8a0c6329a@slerp.xyz> <25e54d1a-3a67-da54-7eb2-9ad3c2895be3@debian.org> <[🔎] ba874f6d-586c-1d54-88c8-075adf9e3b26@slerp.xyz>

Hi Cory,

On 2023-05-13 02:42, Cordell Bloor wrote:
> I saw a good deal on a Radeon VII, so I now have a spare gfx906 GPU to
> donate to this project. I've also purchased a few MI25 GPUs, though I
> still need to validate them. However, I also don't have a server to put
> them in, so I will have to 3D print a fan adapter first.

Server space is indeed a major issue, for two separate reasons I think:
  (1) We need the right hardware
  (2) We need a place where this hardware can run

Ad (1)
======
In terms of cost efficiency, I think one either has to go for the
cheapest possible single-GPU solution (cheapest mainboard, CPU, etc.) or
the cheapest multiple-GPU solution (mainboard with multiple x16 slots,
somewhat beefy CPU, etc.)

The multiple-GPU solution (which I've chosen for my own needs, which is
another reason why I'm so motivated for QEMU pass-through to work) is
complicated by the fact that due to their fan configuration, most
consumer cards cannot be adequately cooled when co-installed on one board.

Both solutions have their pros and cons. I'll put up a few examples,
including my own multi-GPU setup, on the wiki, for further discussion.

Ad (2)
======
We will eventually need to find someone who can run the hardware for us,
which in total will have significant enough power and cooling requirements.

I'm confident that Debian can eventually find a partner or sponsor
willing to do that for us, but we'll first need to validate our solution
from (1).

Until that happens, I fear we'll be constrained to whatever basements
and living rooms contributors can offer. Such is the nature of
bootstrapping.

I'll also put up a brain-storming scratch pad to that end on the wiki.

> If I were to acquire additional hardware for Debian, would the
> preference be to have server GPUs like the MI25 or desktop/workstation
> GPUs like the Radeon VII? They're designed for installation into very
> different systems, so it would be good to ensure I'm finding the right
> sort of hardware.
Given the current AI trend, my gut says we should focus on whatever
consumer [1] cards users are most likely experimenting with right now.
Which, I guess, is any RDNA2 or RDNA3 card with at least 8GB of memory,
though 16GB seems more probable.

I have no idea how CDNA users install ROCm, hence I have no idea which
cards to best support. One would hope that even MI100, MI200, MI250 and
eventually even MI300 users would enjoy to have a distribution where the
software stack is only an apt-get away, but I'm going to assume that
large-scale deployments have very custom setups, probably with direct
support by AMD.

You mentioned earlier in the thread:

On 2023-04-08 09:42, Cordell Bloor wrote:
> I wouldn't worry about that too much. If you can get the software in place for testing, I can figure out how to provide you with the necessary hardware resources. 

I'm still working on that, I'm running some final validations on my ROCm
box, which needed more memory (received today).

I really want to get our autopkgtest infra up and running by May, and
I'm confident that this is doable.

Best,
Christian

[1] See Petter's post [2] to just how much impact regular users are
having on this trend.

[2] https://lists.debian.org/debian-ai/2023/05/msg00019.html

Reply to:

References:
- Re: Testing ROCm on Everything
  - From: Cordell Bloor <cgmb@slerp.xyz>

Prev by Date: rocblas_5.3.3+dfsg-1~exp1_amd64.changes ACCEPTED into experimental
Next by Date: Processing of hipcub_5.3.3-3_source.changes
Previous by thread: Re: Testing ROCm on Everything
Next by thread: rccl_5.3.3-1_amd64.changes ACCEPTED into experimental
Index(es):
- Date
- Thread