Hi Tom,
I look forward to working with you.
Are there some things that need some attention for ROCm or PyTorch ?
There are many things, but I would suggest starting with the
driver.
Someone needs to look at what is required to get the tests provided by librccl1-tests passing on the Debian AI Team's MI210 server ('pinwheel'). I believe that the issue is merely that the HSA P2P KConfig is not enabled by default in Debian.
We will need to engage with the Debian Linux Kernel team to find
a way to enable AMD GPU P2P functionality. This would presumably
be achieved either by providing an alternative kernel package, an
out-of-tree driver package, or changing the default kernel
configuration. If they are open to it, the last of those options
would presumably be the best.
There's also the question of RDMA and other features not
available in the upstream kernel, but I expect that will probably
resolve itself eventually? We may need to engage with the AMD
kernel developers to understand the roadmap for upstreaming
functionality.
Sincerely,
Cory Bloor