[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#991453: linux-image-5.10.0-8-amd64: Radeon 6800 XT: 100% GPU core usage & 74 Watts when idle



On zaterdag 24 juli 2021 22:03:23 CEST Diederik de Haas wrote:
> > It's already backported to 5.10, just after 5.10.46 was released:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h
> > =l inux-5.10.y&id=fea853aca3210c21dfcf07bb82d501b7fd1900a7
> 
> Just found out the reverted commit was introduced just before the 5.10.46
> tag was created, which should mean that any version before 5.10.46 should
> NOT have this problem. 

I just found out it's 2 commits that should be reverted
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=1bd81429d53ded4e111616c755a64fad80849354
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=fea853aca3210c21dfcf07bb82d501b7fd1900a7

The first one I saved (and attached) as 'fix-bug991453-part1.patch' and the 
second one as 'fix-bug991453-part1.patch'

Then I followed step 4.2.1 and 4.2.2 of the kernel handbook:
https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official
That resulted in 'linux-image-5.10.0-8-amd64-unsigned_5.10.46-2a~test_amd64.deb'
which I then installed on my system and rebooted into that.

$ uname -a
Linux bagend 5.10.0-8-amd64 #1 SMP Debian 5.10.46-2a~test (2021-07-24) x86_64 GNU/Linux
$ cat /sys/class/drm/card0/device/gpu_busy_percent
0
$ sensors
nvme-pci-0100
Adapter: PCI adapter
Composite:    +40.9°C  (low  = -273.1°C, high = +72.8°C)
                       (crit = +75.8°C)
Sensor 1:     +40.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +51.9°C  (low  = -273.1°C, high = +65261.8°C)

amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:      750.00 mV 
fan1:        1208 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +42.0°C  (crit = +85.0°C, hyst = -273.1°C)
                       (emerg = +90.0°C)
junction:     +42.0°C  (crit = +105.0°C, hyst = -273.1°C)
                       (emerg = +110.0°C)
mem:          +43.0°C  (crit = +95.0°C, hyst = -273.1°C)
                       (emerg = +100.0°C)
power1:        7.00 W  (cap = 260.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +76.5°C  
Tdie:         +56.5°C

# radeontop
Graphics pipe 0.83%
0.17G / 0.94G Memory Clock 17.67%
0.03G / 1.63G Shader Clock  1.78%

These are the same 'scores' as I had with the 5.10.0-7-amd64 kernel.
So applying the mentioned/attached patches on top of the current kernel
as available in Debian Testing/Bullseye and Sid, fixes the problem.

In the last year I've spend considerable time to bring down my energy
usage/needs and I never expected that (reverting) 2 kernel commits
would save me 67W (continuously), so thank you very much piorunz for
bringing this to my attention.
I normally have a quiet system and noticed it often wasn't quiet lately; 
I blame(d) 'baloo' (file indexing) for that, but it turns out it was mostly
my GPU running at 100% all the time.

As it looks like a lot of users with AMD GPUs are affected and the 
considerable energy wasted because of it (Climate Change), 
I really hope/urge that these 2 patches/reverts are applied before Bullseye
gets released.

Cheers,
  Diederik
>From 1bd81429d53ded4e111616c755a64fad80849354 Mon Sep 17 00:00:00 2001
From: Yifan Zhang <yifan1.zhang@amd.com>
Date: Sat, 19 Jun 2021 11:40:54 +0800
Subject: Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG
 issue."

commit ee5468b9f1d3bf48082eed351dace14598e8ca39 upstream.

This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20.

Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may
cause some APUs fail to enter gfxoff in certain user cases.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 1859d293ef712..fb15e8b5af32f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3619,12 +3619,8 @@ static int gfx_v9_0_kiq_init_register(struct amdgpu_ring *ring)
 	if (ring->use_doorbell) {
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_LOWER,
 					(adev->doorbell_index.kiq * 2) << 2);
-		/* If GC has entered CGPG, ringing doorbell > first page doesn't
-		 * wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround
-		 * this issue.
-		 */
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_UPPER,
-					(adev->doorbell.size - 4));
+					(adev->doorbell_index.userqueue_end * 2) << 2);
 	}
 
 	WREG32_SOC15_RLC(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL,
-- 
cgit 1.2.3-1.el7

>From fea853aca3210c21dfcf07bb82d501b7fd1900a7 Mon Sep 17 00:00:00 2001
From: Yifan Zhang <yifan1.zhang@amd.com>
Date: Sat, 19 Jun 2021 11:39:43 +0800
Subject: Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to
 cover full doorbell."

commit baacf52a473b24e10322b67757ddb92ab8d86717 upstream.

This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3.

Reason for revert: Side effect of enlarging CP_MEC_DOORBELL_RANGE may
cause some APUs fail to enter gfxoff in certain user cases.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 3c92dacbc24ad..fc8da5fed779b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6590,12 +6590,8 @@ static int gfx_v10_0_kiq_init_register(struct amdgpu_ring *ring)
 	if (ring->use_doorbell) {
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_LOWER,
 			(adev->doorbell_index.kiq * 2) << 2);
-		/* If GC has entered CGPG, ringing doorbell > first page doesn't
-		 * wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround
-		 * this issue.
-		 */
 		WREG32_SOC15(GC, 0, mmCP_MEC_DOORBELL_RANGE_UPPER,
-			(adev->doorbell.size - 4));
+			(adev->doorbell_index.userqueue_end * 2) << 2);
 	}
 
 	WREG32_SOC15(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL,
-- 
cgit 1.2.3-1.el7

Attachment: signature.asc
Description: This is a digitally signed message part.


Reply to: