Navi 12 on Debian (was: August ROCm Package Testing Results)
I asked for some help and I now understand the Navi 12 situation.
On 2023-08-26 00:22, Cordell Bloor wrote:
A notable missing entry in my test strategy is gfx1011, which was not
tested because installing Navi 12 hardware seems to cause the amdgpu
driver to crash on startup. I know that Navi 12 hardware can work with
ROCm, because I've used it on AWS g4ad instances running Ubuntu. I'll
have to dig into this more.
The problem appears to be resuming from the BACO (Bus Active, Chip Off)
power saving mode. I had no issues after adding amdgpu.runpm=0 to my
kernel parameters. I've seen this issue with two different Navi 12 GPUs,
so I assume it is a driver problem and not bad hardware. I'll file a bug
report when I get a chance.
I updated the supported GPU list [1] and my logs [2] with the gfx1011
results. I saw no problems, aside from a rocSPARSE test case that
slightly exceeded the specified tolerance. That test failure is seen on
gfx1030, too. We can probably just increase the tolerance slightly.
In conclusion, while there is a serious driver bug affecting Navi 12
hardware, there is a straightforward workaround for the problem. The
ROCm math libraries all work as expected on gfx1011.
Sincerely,
Cory Bloor
[1]:
https://salsa.debian.org/rocm-team/community/team-project/-/wikis/Supported-GPU-list
[2]: https://slerp.xyz/rocm/logs/full/
Reply to: