Xorg freezes with Xid error message, 1-3 minutes after booting.

Xorg freezes with Xid error message 1-3 minutes after booting.
If I don’t start the X server, I get “gpu has fallen off the bus” instead.
This does not happen with 375.26.
Using GTX 860M with PRIME Sync enabled.

nvidia-bug-report.sh after the problem:

[url]MEGA

I’m having exactly the same issue with 378.13, no such problem with 375.26.

To add more details, with KMS enabled I get “GPU has fallen off the bus” immediately after starting Gnome 3, without KMS enabled I can use it for several minutes, after which everything freezes, can’t even switch to VT, sometimes graphical artefacts appear before this.

Here’s some relevant parts from the log when the latter happens:

kernel: NVRM: GPU at PCI:0000:01:00: GPU-f26e0082-2a73-cae2-562c-7fe2cff70d1b
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000008 intr 00040000
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000008 intr 00004000
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000000b intr 80060000
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000000b intr 00024000
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000000b intr 00040000
kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 0000000b intr 00040000
kernel: NVRM: Xid (PCI:0000:01:00): 69, Class Error: ChId 000b, Class 0000902d, Offset 0000028c, Data 20040004, ErrorCode 00000004
kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000000
kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

This is with GTX 860M 2GB, OS is Fedora 25, Xorg 1.19.1, Gnome 3.22.2, kernel 4.9.9.

So far I have tried reinstalling drivers from scratch, disabling threaded optimizations globally to no avail.

Although not exactly the same, it might be related.

I can use X for several minutes regardless of KMS.
I get “GPU has fallen off the bus” if i do not start the X server, however with X running, I get the following error instead, which can be seen when pressing the power button:

Xid (PCI:0000:01:00): 62, 1483(18f0) 00000000 00000000

I should probably add that I use Plasma 5.9.2, Xorg 1.19.1, Linux 4.9.9 on Arch Linux.

Always consult with XID Errors :: GPU Deployment and Management Documentation

A lot of stuff for thought there.

Errors 8 and 32 might both be thermals related.

Error 62 is likely hardware related.

I’ve already looked at that. I don’t think it is hardware related, all other drivers work. I changed the driver, and i got that error. I’m currently using 375.26 with no problem. Xid error manual also says that error 62 can be driver related.

Thanks for the reference but I doubt this is a hardware issue since everything is fine with the previous driver version (375.26).

So I just made several more experiments. Most of the time the freeze locks up system for good — I can’t even ssh into it. Once though I managed to capture some interesting output in the logs that shows an additional timeout error in the Intel wifi driver (iwlwifi), possibly caused by a CPU lock-up. It looks like NVidia driver locking up the CPU, which leads to a complete system freeze eventually.

Here’s the nvidia-bug-report.log.gz after rebooting from a freeze, just before another one:

Another experiment — switching powermizer from “Auto” to “Prefer Maximum Performance” works around this issue apparently. At least it seems fine for 15m+ so far. This is without KMS enabled for me. Will report if it locks up again eventually.

So this indeed helps in my case, replicated several times — no issues when powermizer is in max performance mode. Power management troubles, maybe powering down the discreet GPU when it shouldn’t?

At the same time, with KMS enabled the driver still drops dead (“GPU has fallen off the bus”) immediately on X start. So it’s not really a solution.

Thank you very much for your testing!
Setting PowerMizer to “High Performance” solves the problem for me, even with KMS enabled. I would like to use power management though…
Have you tried starting with a different DE? I guess i should try GNOME, and see if i get the “GPU has fallen off the bus” error.

EDIT: Nope, it does not seem to be DE related, GNOME works as well with the power management tweak.