[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Black screens on old nVidia card



My tower (running Bullseye) has been suffering the black screen
of... well, not quite death (I can ssh in from another machine and
look at things), but it's certainly unusable for normal purposes.
I have an old nVidia video card; I've always had a bit of trouble
with it, but things got better when I replaced nouveau with the
appropriate proprietary nVidia driver (currently version 390.157).
But lately things have been getting worse; it might be only minutes
before my screens go black, and the only way to get things back is
to ssh in from another machine and force a re-boot, or reach for
the Big Red Switch.

$ uname -a
Linux killer-penguin 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux

$ lsb_release -a
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

$ cat /etc/debian_version
11.5

$ lspci -v
01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 630] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: nvidia
	Kernel modules: nvidia

$ top
top - 09:16:25 up  1:08,  2 users,  load average: 1.00, 1.00, 1.00
Tasks: 192 total,   2 running, 190 sleeping    0 stopped,   0 zombie
%Cpu(s): 25.0 us, 0.0 sy, 0.0 ni, 71.3 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem :   7900.4 total,   6840.5 free,    513.5 used,    546.7 buff/cache
MiB Swap:  16384.0 total,  16384.0 free,      0.0 used.   7126.4 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM Time+ Command
    865 root      20   0  278732 105992  55708 R 100.0   1.3  48:37:59 Xorg

A tail of dmesg yields the following messages:

[ 1406.213319] NVRM: GPU at PCI:0000:01:00: GPU-d7903bd4-9549-9f07-5796-886c12d2031c
[ 1406.213322] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 1406.213324] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 1406.213329] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[ 1416.567009] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1416.567013] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:1:0:0x0000000f [ 1416.590288] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1416.590292] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:1:0:0x0000000f [ 1416.590682] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1416.590686] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:1:0:0x0000000f [ 1416.591011] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:0:0:0x0000000f [ 1416.591015] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000857c:1:0:0x0000000f

"GPU has fallen off the bus" looks suspicious.

Yes, that video card is pretty long in the tooth; I'd gladly replace
it if a new board will solve the problem.  (If not, why bother?
It works well enough for my purposes.)

I tried running nvidia-bug-report.sh as recommended in the dmesg dump.
It generated a _lot_ of data.  Is there a guide to interpreting it?
I did notice the following lines:

  (==) Matched nvidia as autoconfigured driver 0
  (==) Matched nouveau as autoconfigured driver 1

Does this mean that nouveau is still there and possibly causing a
conflict?

Can anyone suggest where to look next?  Thanks...

--
/~\  Charlie Gibbs                  |  Life is perverse.
\ /  <cgibbs@kltpzyxm.invalid>      |  It can be beautiful -
 X   I'm really at ac.dekanfrus     |  but it won't.
/ \  if you read it the right way.  |    -- Lily Tomlin


Reply to: