[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Working towards pytorch-rocm



Hi folks,

I wanted to say that I'd have pytorch-rocm built today, but after a solid evening of fighting, I still haven't been able to do it. The problem is mainly that PyTorch is a moving target and while we had everything needed to package PyTorch 2.4.0, the updates made to PyTorch 2.6.0 significantly increase the baseline required.

There are now a number of places where PyTorch assumes ROCm 6.0 or greater, and while it's not that hard to patch in compatibility with older versions of ROCm, there's quite a few of them. The dependency on hipBLASLt has also become deeper making it much more difficult to remove [1]. There is now also a hard dependency on composable_kernel [1].

It wasn't too hard to revert the composable_kernel additions, but I'm failing on the hipBLASLt removal. It seems silly that hipBLASLt is a required dependency when it's an enormous library and it only supports a tiny subset of AMD GPUs. Still, if we finish packaging hipBLASLt, then we won't need to worry about patching it out. It may be worth consulting the pytorch-rocm package in Fedora for patches, as they face some of the same constraints [3].

I haven't been hacking directly on the PyTorch Debian package as that is even more complex than just getting the upstream PyTorch repo building. I want to take things one step at a time. This is how I've been hacking on pytorch:

git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
apt install python3-full python3-dev ninja-build cmake hipcc libhipblas-dev librocblas-dev libhipsolver-dev librocsolver-dev libhipfft-dev librocfft-dev libhipsparse-dev librocsparse-dev librocthrust-dev librocprim-dev libhipcub-dev librccl-dev libmagma-rocm-dev
python3 -m venv venv3
source venv3/bin/activate
pip install -r requirements.txt
pip install mkl-static mkl-include
export USE_CUDA=0
export USE_ROCM=1
export USE_XPU=0
export USE_KINETO=0 # requires roctracer otherwise
export USE_CK_FLASH_ATTENTION=0
export _GLIBCXX_USE_CXX11_ABI=1
export PYTORCH_ROCM_ARCH=gfx906 # my gpu
export ROCM_PATH=/usr
export HIP_DEVICE_LIB_PATH="/usr/lib/llvm-17/lib/clang/17/amdgcn/bitcode"
export HIP_CLANG_PATH=/usr/bin # workaround Bug #1099404
export CXXFLAGS='-Wno-error' # gcc errors otherwise
python tools/amd_build/build_amd.py
# <edit files here>
python setup.py develop
# ninja -C build # to continue build after making changes

I got libgloo_hip.a and libc10_hip.so built, but I think the path to getting aten/ building involves putting more work into getting the ROCm stack updated. With that said, I might still try to get a patched version of PyTorch 2.5.0 building anyway, just because I think that might still be useful even if it cannot be included in Debian.

On 2025-03-02 15:41, Cordell Bloor wrote:
Today I'll file a proper bug for that issue and maybe do one final upload of rocm-hipamd 5.7 to unstable.

My justification for doing that would have been to make it easier to proceed with pytorch-rocm. As it's not going to help with that, I'm not going to bother. Might as well just fix these bugs in the latest version.

Sincerely,
Cory Bloor

[1]: https://github.com/pytorch/pytorch/pull/120551
[2]: https://github.com/pytorch/pytorch/commit/3f3b692a00737c54a3e2948db5db493d40119854
[3]: https://src.fedoraproject.org/rpms/python-torch/tree/rawhide


Reply to: