[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: RFC: Strategy for getting ROCm test coverage



Hi Christian,

On 2023-03-12 08:38, Christian Kastner wrote:
   (2) There are also concerns regarding security and stability of
       drivers and hardware, in the sense of a potentially increased
       attack surface and/or more maintenance work.

There are a few cloud vendors that provide access to servers with AMD GPUs. With the scale that they operate on, I would expect them to have those sorts of issues handled. Perhaps that is an option?

AWS EC2 has G4ad instances with Radeon Pro V520 GPUs [1]. That is Navi 12 hardware (gfx1011) [2]. There are plenty of options for single-GPU VMs and a few options for multi-GPU VMs. If we're able to use spot instances and only spin up VMs when we need them, it might end up being reasonably cost-effective.

Microsoft Azure has NVv4 instances with MI25 GPUs [3]. That is Vega 10 hardware (gfx900) [4]. Unfortunately, I see that these instances are Windows-only, so I don't think we will be able to use these.

Incidentally, the MI25 and MI60 GPUs have become quite affordable on the used market. MI25 is practically being given away on ebay right now. I suspect that not many people have forced-air server rack like they require, but they would make a great testing platform for us if we had somewhere to put them.

Sincerely,
Cory Bloor

[1]: https://aws.amazon.com/ec2/instance-types/g4/
[2]: https://www.techpowerup.com/gpu-specs/radeon-pro-v520.c3755
[3]: https://learn.microsoft.com/en-us/azure/virtual-machines/nvv4-series
[4]: https://www.techpowerup.com/gpu-specs/radeon-instinct-mi25.c2983


Reply to: