Bug#1074466: RFP: nvidia-mofed -- MVidia MLNX OFED software for Infiniband
Package: wnpp
Severity: wishlist
* Package name : nvidia-mofed
Version : 24.04-0.6.6.
Upstream Contact: I don't know
* URL : https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_SRC-debian-24.04-0.6.6.0.tgz
https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-debian12.1-x86_64.tgz
* License : I don't know
Programming Lang: I don't know
Description : MVidia MLNX OFED software for Infiniband
Debian 12.5 contains packages for Infiniband based on OFED. However,
thse packages don't fully support the most recent ConnectX-7 hardware.
opensm does not support NDR. Infiniband needs a subnet manager. If you
use an unmanaged switch, you need to run opensm. And the
nvidia-peermem kernel module does not load. You need this kernel
module to do RDMA direct to GPUs. Without this, NVidia NCCL bandwidth
is about one tenth the speed possible with this.
The nvidia-peermem kernel module is provided by nvidia-kernel-dkms in
Debian 12.5. But for it to load, it must be compiled with Mellanox
ib_peer_mem symbols. Apparently, this is all handled by MOFED.
NVidia provides MOFED aka MLNX OFED packages for Debian 12.1
https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_SRC-debian-24.04-0.6.6.0.tgz
https://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-debian12.1-x86_64.tgz
But if you install them, the uninstall many standard Debian 12.5
packages, among them
libboost-all-dev
libopenmpi-dev
libopencv-dev
python3-opencv
python3-torch
ros-desktop-full-dev
and many of their dependencies. People (like myself) who use GPUs and
Infiniband are likely to also need Open MPI, OpenCV, PyTorch, and ROS.
It would be great if Debian could officially package MOFED in non-free
in a way that was compatible with the Debian ecosystem. So one could
just use apt to do package management
In the longer term (fall 2024), NVidia plans to replace MOFED with
DOCA. It would be great if Debian could package DOCA as well.
This should not be difficult since MOFED works seemlessly with Ubuntu
24.04 LTS.
To get a subnet manager, I needed to set up an whole other machine
running Ubuntu 24.04. It would be much more convenient if I could just
continue to be a pure Debian user.
I am not a Debian dev. I am not familiar with Debian dev/maintenance
practices.
Reply to: