Re: A deep learning rig with 8 GPUs
The same applies to the nvidia platform. I'm working with a bunch
of 8-GPU servers (4U size). And management can be not fun at all
if we use any configuration that is not battle-tested. Even with
server-grade solutions, we still have to reboot due to various kinds
of problems like driver bug and hardware issues.
On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
> Hi Christian,
>
> On 2023-08-16 05:17, Christian Kastner wrote:
> > Ha, this just popped up on HN again. I saw this last year but
> > forgot to
> > bookmark it, and have been looking for it since:
> >
> > https://nonint.com/2022/05/30/my-deep-learning-rig/
> >
> > The impressive feat here is driving 8 consumer-grade GPUs (each
> > with
> > 350W draw) off of a single mainboard, and with two independent
> > power
> > supplies. This requires all sorts of trickery. Just figuring out
> > the
> > cooling alone is a major feat. This is amazing stuff.
> >
> > Two of those rigs could probably cover all the AMD GPUs we want to
> > test.. though most hosts will require something rackable, I fear.
>
> In practice, I think the logistics will be significantly more
> difficult
> than that. You can certainly stuff a bunch of AMD GPUs into a box,
> but
> even with PCIe pass-through to isolate the GPUs, you may find that
> sometimes the only reliable way to restore the GPU to a known-good
> state
> is to power-cycle the system. Not all hardware is as well-behaved as
> Navi 21.
>
> Sincerely,
> Cory Bloor
>
Reply to: