Re: A deep learning rig with 8 GPUs

To: Cordell Bloor <cgmb@slerp.xyz>, debian-ai@lists.debian.org
Subject: Re: A deep learning rig with 8 GPUs
From: "M. Zhou" <lumin@debian.org>
Date: Thu, 24 Aug 2023 13:56:47 -0400
Message-id: <[🔎] d152b3a54b932d44fd002051e4a2b33215c1fefa.camel@debian.org>
In-reply-to: <[🔎] 402ff6c8-682e-e316-87f9-b66266a22cb6@slerp.xyz>
References: <[🔎] 9a9628bf-c987-735f-3a12-3507211a0d87@debian.org> <[🔎] 402ff6c8-682e-e316-87f9-b66266a22cb6@slerp.xyz>

The same applies to the nvidia platform. I'm working with a bunch
of 8-GPU servers (4U size). And management can be not fun at all
if we use any configuration that is not battle-tested. Even with
server-grade solutions, we still have to reboot due to various kinds
of problems like driver bug and hardware issues.

On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
> Hi Christian,
> 
> On 2023-08-16 05:17, Christian Kastner wrote:
> > Ha, this just popped up on HN again. I saw this last year but
> > forgot to
> > bookmark it, and have been looking for it since:
> > 
> > https://nonint.com/2022/05/30/my-deep-learning-rig/
> > 
> > The impressive feat here is driving 8 consumer-grade GPUs (each
> > with
> > 350W draw) off of a single mainboard, and with two independent
> > power
> > supplies. This requires all sorts of trickery. Just figuring out
> > the
> > cooling alone is a major feat. This is amazing stuff.
> > 
> > Two of those rigs could probably cover all the AMD GPUs we want to
> > test.. though most hosts will require something rackable, I fear.
> 
> In practice, I think the logistics will be significantly more
> difficult 
> than that. You can certainly stuff a bunch of AMD GPUs into a box,
> but 
> even with PCIe pass-through to isolate the GPUs, you may find that 
> sometimes the only reliable way to restore the GPU to a known-good
> state 
> is to power-cycle the system. Not all hardware is as well-behaved as 
> Navi 21.
> 
> Sincerely,
> Cory Bloor
>

Reply to:

Follow-Ups:
- Re: A deep learning rig with 8 GPUs
  - From: Christian Kastner <ckk@debian.org>

References:
- A deep learning rig with 8 GPUs
  - From: Christian Kastner <ckk@debian.org>
- Re: A deep learning rig with 8 GPUs
  - From: Cordell Bloor <cgmb@slerp.xyz>

Prev by Date: Re: A deep learning rig with 8 GPUs
Next by Date: gloo-cuda_0.0~git20220518.5b14351-5_amd64.changes ACCEPTED into unstable
Previous by thread: Re: A deep learning rig with 8 GPUs
Next by thread: Re: A deep learning rig with 8 GPUs
Index(es):
- Date
- Thread