GPU has fallen off the bus

NVRM: GPU at PCI:0000:65:00: GPU-cd57429b-a4d9-917d-72d6-1d9b6c4f6a3a
NVRM: GPU Board Serial Number:
NVRM: Xid (PCI:0000:65:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:65:00.0 has fallen off the bus.
NVRM: GPU is on Board .
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
sched: RT throttling activated

Then Linux crashed.

$ lspci -vv | grep -w -A2 NVIDIA
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
--
17:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
--
18:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
--
18:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
--
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
--
65:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
--
b4:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. GP102 [GeForce GTX 1080 Ti]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
--
b4:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP102 HDMI Audio Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-

nvidia-bug-report.log.gz (276 KB)

XID 79 points to overheating or insufficient/unstable power supply

I had another one.

NVRM: GPU at PCI:0000:65:00: GPU-cd57429b-a4d9-917d-72d6-1d9b6c4f6a3a
NVRM: GPU Board Serial Number: 
NVRM: Xid (PCI:0000:65:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:65:00.0 has fallen off the bus.
NVRM: GPU is on Board .
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

Full log: https://gist.github.com/kenorb/9b41910fbced376314b7dda50ccad2cd

I will check the settings in BIOS next time.

The following post suggests it’s the issue with ASUS motherboards:

It’s suggested to change the kernel option to:

pcie_aspm=off

. I’ll try that as well.

1 Like

Screenshots from NVIDIA X Server Settings app of the failing GPU (1st of 4):



Did you fix this issue, iam having this on 4.29 and latrst 5 kernels with all nvidia-drivers available on gentoo system. Strange is that when I try gpu_burn - whih is CUDA stresser, all is ok. The problem only occurs when I start X based stuff (xorg or plasma).

I have this exact same issue every day in Ubuntu 22.04 since I got a GeForce RTX 3060, with every kernel and every Nvidia video driver in the 5xx range. I have tried many different kernel/videodriver combinations and they all have the same problem.

The freeze usually occurs within 2 hours of booting the computer. I have tried multiple boot options I read around the net that solved the issue for others, such as the famous pcie_aspm=off, but they don’t make a difference in my case. That may be because I don’t have an ASUS mainboard like most people who report this issue. I have a Gigabyte X570 I Aorus Pro in stead.

While it sucks to a pretty infuriating level that this problem persists for such a long time over so many updates, I have discovered something interesting. The issue never reappears after a soft reboot.

So I do a Sync, Unmount and reBoot SysRequest (i.e. hold Alt + SysRq while pressing S, U, and B in slow succession) and the problem is gone until I do a cold boot.

I know this topic is old, but it’s still the first result DuckDuckGo gives me. I thought I’d share that new information here.

10:06:20 kernel: [ 3901.114072] NVRM: GPU at PCI:0000:09:00: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
10:06:20 kernel: [ 3901.114079] NVRM: Xid (PCI:0000:09:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
10:06:20 kernel: [ 3901.114081] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
10:06:20 kernel: [ 3901.114739] NVRM: Xid (PCI:0000:09:00): 32, pid=3502, name=cinnamon, Channel ID 00000010 intr 00800000
10:09:52 kernel: [ 4113.229473] sysrq: Emergency Sync
10:09:52 kernel: [ 4113.229691] Emergency Sync complete
10:09:54 kernel: [ 4114.232443] sysrq: Emergency Remount R/O

Rectification: This is false. Just lucky for some time. See this thread for more on this.

The safest way to reboot a frozen machine is still a Sync, Unmount and reBoot SysRequest (i.e. hold Alt + SysRq while pressing S, U, and B) as long as the kernel is still running.