[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Introductions for Tom Rix



Hi Tom,

I look forward to working with you.

On 2024-08-08 18:16, Rix, Tom wrote:

Are there some things that need some attention for ROCm or PyTorch ?

There are many things, but I would suggest starting with the driver.

Someone needs to look at what is required to get the tests provided by librccl1-tests passing on the Debian AI Team's MI210 server ('pinwheel'). I believe that the issue is merely that the HSA P2P KConfig is not enabled by default in Debian.

We will need to engage with the Debian Linux Kernel team to find a way to enable AMD GPU P2P functionality. This would presumably be achieved either by providing an alternative kernel package, an out-of-tree driver package, or changing the default kernel configuration. If they are open to it, the last of those options would presumably be the best.

There's also the question of RDMA and other features not available in the upstream kernel, but I expect that will probably resolve itself eventually? We may need to engage with the AMD kernel developers to understand the roadmap for upstreaming functionality.

Sincerely,
Cory Bloor


Reply to: