Re: gfx1100: pass-through solved? \o/
On 2024-09-17 20:53, Brian DeRocher wrote:
> Well it took a while, but finally got through 15 rounds of git bisect.
> The first bad commit is de59b69932e6.
That's fantastic news, thanks for the effort Brian!
> And a link[4] to the first bad commit.
>
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de59b69932e64d77445d973a101d81d6e7e670c6
Great, it's a relatively small diff. And even though I know very little
about the amdgpu module, one hunk relevant to gfx1100 sticks out:
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index d0e3583a3cac8b..e9cbe81221548d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -635,7 +635,8 @@ static void gmc_v11_0_vram_gtt_location(struct amdgpu_device *adev,
amdgpu_gmc_vram_location(adev, &adev->gmc, base);
amdgpu_gmc_gart_location(adev, mc);
- amdgpu_gmc_agp_location(adev, mc);
+ if (!amdgpu_sriov_vf(adev))
+ amdgpu_gmc_agp_location(adev, mc);
amdgpu_sriov_vf() is some virtualization capability check [5]. It
seems that when the driver realizes that it is running in a
virtualized environment, it disables the AGP aperture.
This is confirmed by the dmesg diff between 6.6 and 6.7 of lines
containing 'amdgpu':
$ diff -Nru good bad
--- good 2024-09-17 21:39:17.436765109 +0200
+++ bad 2024-09-17 21:26:50.328875874 +0200
@@ -12,6 +12,8 @@
amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_me.bin
amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_rlc.bin
amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_mec.bin
+ amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_imu.bin
+ amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/sdma_6_0_0.bin
amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/vcn_4_0_0.bin
amdgpu 0000:01:00.0: ] JPEG decode is enabled in VM mode
amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_mes_2.bin
@@ -27,12 +29,9 @@
amdgpu 0000:01:00.0: BAR 0: assigned amdgpu 0000:01:00.0: BAR 2: assigned amdgpu 0000:01:00.0: amdgpu: VRAM: 30704M 0x0000008000000000 - 0x000000877EFFFFFF (30704M used)
- amdgpu 0000:01:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
- amdgpu 0000:01:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
+ amdgpu 0000:01:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
amdgpu: 30704M of VRAM memory ready
amdgpu: 23561M of GTT memory ready.
- amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/gc_11_0_0_imu.bin
- amdgpu 0000:01:00.0: firmware: direct-loading firmware amdgpu/sdma_6_0_0.bin
amdgpu 0000:01:00.0: amdgpu: Will use PSP to load VCN firmware
amdgpu 0000:01:00.0: amdgpu: GECC is enabled
amdgpu 0000:01:00.0: amdgpu: RAP: optional rap ta ucode is not available
@@ -64,5 +63,5 @@
amdgpu 0000:01:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
amdgpu 0000:01:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
amdgpu 0000:01:00.0: amdgpu: Using BACO for runtime pm
- Initialized amdgpu 3.54.0 20150101 for 0000:01:00.0 on minor 0
+ Initialized amdgpu 3.57.0 20150101 for 0000:01:00.0 on minor 0
amdgpu 0000:01:00.0: Cannot find any crtc or sizes
Indeed, the AGP line is missing in the bad case. (Also, GART memory
region no longer begins at 0, no idea what effect that has, if any.)
Now the bad commit indicates only that a default has changed. Looking at
the module parameters [6], it seems that AGP can be enabled.
So I added amdgpu.agp=1 to the kernel command line and... success! \o/
rocminfo works again, and a few tests that I have run also work fine.
Tested this with 6.7 and 6.10, firmware 202407.
Could you give this a try on your end, too?
Best,
Christian
[5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h#n286
[6]: https://www.kernel.org/doc/html/v6.11/gpu/amdgpu/module-parameters.html
Reply to: