Re: ROCm on APUs

To: debian-ai@lists.debian.org
Subject: Re: ROCm on APUs
From: Cordell Bloor <cgmb@slerp.xyz>
Date: Tue, 28 May 2024 23:42:33 -0600
Message-id: <[🔎] ce83bd43-1d87-4c14-b88f-187f36963a0c@slerp.xyz>
In-reply-to: <12b6e39d-e088-4af6-952d-ea2fed91293e@slerp.xyz>
References: <12b6e39d-e088-4af6-952d-ea2fed91293e@slerp.xyz>


On 2024-04-02 16:17, Cordell Bloor wrote:

The ROCm packages for Debian are built such that they can run on AMDAPUs, however, there is a major limitation. Integrated GPUs are oftenconfigured with a relatively small amount of initial memory dedicatedto the GPU. The APU expects that the memory reserved for the GPU willbe adjusted dynamically. Unfortunately, HIP applications will notautomatically request more memory to be assigned to the GPU and aretherefore stuck with the default allocation.
Carlos Segura has an interesting workaround [1]. Using LD_PRELOAD,they replace hipMalloc / hipFree with hipHostMalloc / hipHostFree toforce all device memory allocations to use pinned host memory instead.

After raising the topic here, I briefly discussed this workaround with anumber of folks, including Felix Kuehling. He pointed out that replacinghipMalloc with hipHostMalloc is likely to break the CUDA IPC API. Hesuggested that perhaps they could instead adapt the approach taken inthe driver for MI300A to smaller APUs. That is, replacing hipMalloc withkernel-allocated system memory buffer objects.

I'm not sure whether it was as a result of that conversation, or if theKFD developers were working on this anyway, but it seems that a patchimplementing this approach landed for Linux 6.10 RC1 [2][3]. There is atleast one user report of successfully running Stable Diffusion withoutthe force-host-allocation hack [4].

It's nice to see that consumer APUs are benefiting from the work donefor MI300A.


Sincerely,
Cory Bloor

[1]: https://github.com/segurac/force-host-alloction-APU


[2]: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

[3]:https://gitlab.freedesktop.org/drm/kernel/-/commit/eb853413d02c8d9b27942429b261a9eef228f005

[4]: https://github.com/ROCm/ROCm/issues/2014#issuecomment-2131988809

Reply to:

Prev by Date: gpuenv-utils v0.1: Utilities for cooperative access to GPUs
Next by Date: Anyone going to PyTorch Conference 2024?
Previous by thread: gpuenv-utils v0.1: Utilities for cooperative access to GPUs
Next by thread: Anyone going to PyTorch Conference 2024?
Index(es):
- Date
- Thread