Working towards pytorch-rocm
Hi folks,
I wanted to say that I'd have pytorch-rocm built today, but after a
solid evening of fighting, I still haven't been able to do it. The
problem is mainly that PyTorch is a moving target and while we had
everything needed to package PyTorch 2.4.0, the updates made to PyTorch
2.6.0 significantly increase the baseline required.
There are now a number of places where PyTorch assumes ROCm 6.0 or
greater, and while it's not that hard to patch in compatibility with
older versions of ROCm, there's quite a few of them. The dependency on
hipBLASLt has also become deeper making it much more difficult to remove
[1]. There is now also a hard dependency on composable_kernel [1].
It wasn't too hard to revert the composable_kernel additions, but I'm
failing on the hipBLASLt removal. It seems silly that hipBLASLt is a
required dependency when it's an enormous library and it only supports a
tiny subset of AMD GPUs. Still, if we finish packaging hipBLASLt, then
we won't need to worry about patching it out. It may be worth consulting
the pytorch-rocm package in Fedora for patches, as they face some of the
same constraints [3].
I haven't been hacking directly on the PyTorch Debian package as that is
even more complex than just getting the upstream PyTorch repo building.
I want to take things one step at a time. This is how I've been hacking
on pytorch:
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
apt install python3-full python3-dev ninja-build cmake hipcc
libhipblas-dev librocblas-dev libhipsolver-dev librocsolver-dev
libhipfft-dev librocfft-dev libhipsparse-dev librocsparse-dev
librocthrust-dev librocprim-dev libhipcub-dev librccl-dev libmagma-rocm-dev
python3 -m venv venv3
source venv3/bin/activate
pip install -r requirements.txt
pip install mkl-static mkl-include
export USE_CUDA=0
export USE_ROCM=1
export USE_XPU=0
export USE_KINETO=0 # requires roctracer otherwise
export USE_CK_FLASH_ATTENTION=0
export _GLIBCXX_USE_CXX11_ABI=1
export PYTORCH_ROCM_ARCH=gfx906 # my gpu
export ROCM_PATH=/usr
export HIP_DEVICE_LIB_PATH="/usr/lib/llvm-17/lib/clang/17/amdgcn/bitcode"
export HIP_CLANG_PATH=/usr/bin # workaround Bug #1099404
export CXXFLAGS='-Wno-error' # gcc errors otherwise
python tools/amd_build/build_amd.py
# <edit files here>
python setup.py develop
# ninja -C build # to continue build after making changes
I got libgloo_hip.a and libc10_hip.so built, but I think the path to
getting aten/ building involves putting more work into getting the ROCm
stack updated. With that said, I might still try to get a patched
version of PyTorch 2.5.0 building anyway, just because I think that
might still be useful even if it cannot be included in Debian.
On 2025-03-02 15:41, Cordell Bloor wrote:
Today I'll file a proper bug for that issue and maybe do one final
upload of rocm-hipamd 5.7 to unstable.
My justification for doing that would have been to make it easier to
proceed with pytorch-rocm. As it's not going to help with that, I'm not
going to bother. Might as well just fix these bugs in the latest version.
Sincerely,
Cory Bloor
[1]: https://github.com/pytorch/pytorch/pull/120551
[2]:
https://github.com/pytorch/pytorch/commit/3f3b692a00737c54a3e2948db5db493d40119854
[3]: https://src.fedoraproject.org/rpms/python-torch/tree/rawhide
Reply to: