[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A deep learning rig with 8 GPUs



The same applies to the nvidia platform. I'm working with a bunch
of 8-GPU servers (4U size). And management can be not fun at all
if we use any configuration that is not battle-tested. Even with
server-grade solutions, we still have to reboot due to various kinds
of problems like driver bug and hardware issues.

On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
> Hi Christian,
> 
> On 2023-08-16 05:17, Christian Kastner wrote:
> > Ha, this just popped up on HN again. I saw this last year but
> > forgot to
> > bookmark it, and have been looking for it since:
> > 
> > https://nonint.com/2022/05/30/my-deep-learning-rig/
> > 
> > The impressive feat here is driving 8 consumer-grade GPUs (each
> > with
> > 350W draw) off of a single mainboard, and with two independent
> > power
> > supplies. This requires all sorts of trickery. Just figuring out
> > the
> > cooling alone is a major feat. This is amazing stuff.
> > 
> > Two of those rigs could probably cover all the AMD GPUs we want to
> > test.. though most hosts will require something rackable, I fear.
> 
> In practice, I think the logistics will be significantly more
> difficult 
> than that. You can certainly stuff a bunch of AMD GPUs into a box,
> but 
> even with PCIe pass-through to isolate the GPUs, you may find that 
> sometimes the only reliable way to restore the GPU to a known-good
> state 
> is to power-cycle the system. Not all hardware is as well-behaved as 
> Navi 21.
> 
> Sincerely,
> Cory Bloor
> 


Reply to: