[SOLVED] (367 - 378.13) + 980m + Ubuntu 16.10 = ERROR: GPU:0: Idling display engine timed out

Booting any driver newer than 364.19 on my 980m powered Clevo P650-RG results in nothing but a cursor in the top left of the screen when nvidia-modeset allocates the GPU.

Driver version 364.19 works fine, but it does not compile under newer kernels. Ubuntu 16.10 is now using 4.8, so this is a bit of a problem for me.

The following is seen in the kernel log:

Sep 25 13:57:32 sager kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 247
Sep 25 13:57:32 sager kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  370.28  Thu Sep  1 19:45:04 PDT 2016
Sep 25 13:57:32 sager kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  370.28  Thu Sep  1 19:18:48 PDT 2016
Sep 25 13:57:32 sager kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Sep 25 13:57:36 sager kernel: nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
Sep 25 13:58:03 sager kernel: nvidia-modeset: Allocated GPU:0 (GPU-e2f980da-ea7e-4335-6ae3-41ae731aed6d) @ PCI:0000:01:00.0
Sep 25 13:58:03 sager kernel: NVRM: GPU at PCI:0000:01:00: GPU-e2f980da-ea7e-4335-6ae3-41ae731aed6d
Sep 25 13:58:03 sager kernel: NVRM: Xid (PCI:0000:01:00): 61, 13ee(3360) 00000000 00000000
Sep 25 13:58:03 sager nvidia-persistenced[7440]: Verbose syslog connection opened
Sep 25 13:58:03 sager nvidia-persistenced[7440]: Now running with user ID 123 and group ID 131
Sep 25 13:58:03 sager nvidia-persistenced[7440]: Started (7440)
Sep 25 13:58:03 sager nvidia-persistenced[7440]: device 0000:01:00.0 - registered
Sep 25 13:58:03 sager nvidia-persistenced[7440]: Local RPC service initialized
Sep 25 13:58:06 sager kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification; continuing.
Sep 25 13:58:08 sager kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:0x00000040

This is somewhat related to https://devtalk.nvidia.com/default/topic/937319/linux/367-370-xx-980m-w-4k-screen-lock-up-at-boot-ubuntu-16-10-/, but I’m starting a new topic as the symptoms have changed when using 370.28. The latest driver no longer triggers the original hard-lock issue I experienced with all driver versions between 364.19 and 370.28, so after four months I was finally able to create the bug report (attached).
nvidia-bug-report.log.gz (213 KB)
nvidia-bug-report.log.gz (213 KB)

I tested with the new 367.57 drivers, but the problem persists:

Oct 12 08:01:22 sager kernel: [    1.549905] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
Oct 12 08:01:22 sager kernel: [    1.574449] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  367.57  Mon Oct  3 20:32:57 PDT 2016
Oct 12 08:01:22 sager kernel: [    1.577443] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Oct 12 08:01:22 sager kernel: [   41.810294] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 244
Oct 12 08:01:29 sager kernel: [   67.048166] nvidia-modeset: Allocated GPU:0 (GPU-e2f980da-ea7e-4335-6ae3-41ae731aed6d) @ PCI:0000:01:00.0
Oct 12 08:01:34 sager nvidia-persistenced: Verbose syslog connection opened
Oct 12 08:01:34 sager nvidia-persistenced: Now running with user ID 123 and group ID 131
Oct 12 08:01:34 sager nvidia-persistenced: Started (9445)
Oct 12 08:01:34 sager nvidia-persistenced: device 0000:01:00.0 - registered
Oct 12 08:01:34 sager nvidia-persistenced: Local RPC service initialized
Oct 12 08:01:36 sager kernel: [   74.131250] nvidia-modeset: WARNING: GPU:0: Lost display notification; continuing.
Oct 12 08:01:38 sager kernel: [   76.902229] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000957d:0:0:0x00000040

I reverted to 364.19, which continues to work fine on the same system.

Is this identified as an issue in the internal bug tracker? I’ve spent a lot of time testing each new version (multiple times each) for the last 5 months and posting results in these threads, yet I’ve not seen any feedback acknowledging this issue or asking for any specific additional information.

There’s a similar-sounding Xid 61 bug, number 200228928.

Does this problem still occur if you disable persistence mode?

Thank you for taking the time to look into this.

Does this problem still occur if you disable persistence mode?

Persistence mode appears to be disabled by default. I’ve booted into recovery mode and tried resuming the boot with persistence mode both disabled and enabled per the commands I found at Driver Persistence :: GPU Deployment and Management Documentation.

The only difference I can see is that when persistence mode was disabled I would see a single underline cursor in the top right after it loaded. With persistence mode enabled, the screen was completely black after it loaded.

Problem still persists with 375.20 drivers. Attaching bug report…
Reverting to 364.19 works fine.
nvidia-bug-report.log.gz (213 KB)

Issue is still present when using the 375.26 drivers. Had to revert back to 364.19 again to make video usable.

Attaching new bug report:
nvidia-bug-report.log.gz (214 KB)

Same here with 375.26 and it happens randomly when closing opengl games.

Same problem when testing with 378.09 drivers. :\ This is the 8th month since a driver was released that actually works with my laptop.

Reverting to 364.19, as usual, makes everything work again.

[attaching bug report]
nvidia-bug-report.log.gz (212 KB)

Did you ever try disconnecting the external monitor and boot? This looks strange: Virtual screen size determined to be 6400 x 2160
Maybe generate a nvidia-bug-report.sh while the working driver is installed to have a base to compare?

Did you ever try disconnecting the external monitor and boot?

Many times. Same symptoms.

This looks strange: Virtual screen size determined to be 6400 x 2160

It is correct: 2560x1080 external monitor + 3840x2160 laptop screen (g-sync).

In 378.09, I am seeing
nvidia-modeset: WARNING: GPU:0: Lost display notification; continuing.
ERROR: GPU:0: Idling display engine timed out
after resuming from hibernate.
X server freezes and I ssh to kill it and reboot.

I upgraded to a core i5 skylake CPU and 8GB ram. (next step in the future would be a corei7 kabylake)
The random " ERROR: GPU:0: Idling display engine timed out" disappeared.
Incidentally a “hotplug” warning on hibernate disappeared as well. I’m not sure they are related but even with heavy ram usage, my system is very robust right now.

My laptop is an i7-6700HQ with 32GB of RAM… so not really in need of an upgrade.

This morning I tested with 378.13 and the 4.4 and 4.10 kernels, same black screen with cursor issue as always. Backleveling to 364.19 or earlier fixed the problem, as usual.

Attaching bug report…
nvidia-bug-report.log.gz (213 KB)

Might sound stupid but did you ever try adding nvidia-drm.modeset=1 to kernel commandline?

Yes, I’ve spent hours trying all kinds of combinations to see if any would make it go further, such as:

nomodeset nvidia.modeset=0 i915.modeset=0 nouveau.modeset=0

and…

nomodeset acpi_osi="Linux"

and…

nvidia-drm.modeset=1 vga=0 rdblacklist=nouveau nouveau.modeset=0

etc…

Ok, let me sum things up I’m getting from your description and your logs:
-You have a Clevo 650 Optimus laptop with 980m dGPU
-The iGPU is disabled in bios, so no Optimus
-You have one internal HiDPI display and one external HiDPI display connected
-You don’t have an xorg.conf, so xserver runs autoconfigured (xorg.conf not in logs)
-Xserver detects the right resolution 6400 x 2160 but driver bails out on setting modes
-From logs: xrandr says you have a 1280x800 internal display connected<-wrong
Is this correct?
Two things to try:
-Generate a xorg.conf, generate some modelines for some standard resolution and force them in xorg.conf
-Activate hybrid mode in bios and try to setup prime

-The iGPU is disabled in bios, so no Optimus

No. I have had prime set up and been using it successfully since well before the 364 drivers.

05:31:53 evil@sager ~/src/imp» prime-select query
nvidia

-You have one internal HiDPI display and one external HiDPI display connected

The external display isn’t HiDPI - it’s a double-wide 2560x1080 screen. I get the same results with this screen disconnected entirely.

From logs: xrandr says you have a 1280x800 internal display connected<-wrong

You are correct in the statement that this is wrong. There is no 1280x800 screen. The built in 3840x2160 gsync screen actually doesn’t even show any other hardware resolutions under Linux, the only way to change the resolution is to override it in the nVidia driver (which offers a “1920x1080 (software)” option) and does not work very well at all in my experience - sometimes it works, sometimes the Unity launcher freaks out and disappears.

Things to try:

I’ve tried to stay away from overriding xorg.conf like the old days, because I disconnect/connect this system to different monitors every single day. Also, seven or eight months back, I tried to find out the exact make of the internal display and turned up zero as to any specs, supported modes, or even the manufacturer.

Prime works fine in 364.19 and most drivers before (there are a few previous to that version that didn’t work either, then even older ones that do work), so I’ve been keeping that enabled since I use this laptop on battery 3 hours a day.

Sorry, you’re getting something very wrong here. According to your last logs, you’re not using prime. Your iGPU is disabled. Furthermore, you can’t use prime without an xorg.conf.
You can use prime-select as often as you want to switch to intel or nvidia, as long as the iGPU is disabled in BIOS it does not have any effect. You can have a look at /var/log/gpu-manager.log, it will probably tell you that there is nothing to switch to.
Edit: When using the Ubuntu nvidia-prime infrastructure, it will generate a xorg.conf for you but only if the iGPU is enabled.

You, sir, have just made a friend.

I did forget that I had disabled the iGPU some time back as a simpler workaround to the nvidia-367 bug where it would pick the intel driver instead of the modesetting driver when using prime. Since I constantly had to revert to 367, it seemed like an expedient way to make it work and take the iGPU and Prime out of the equation when troubleshooting.

Now, I wouldn’t expect that re-enabling the iGPU would help with a problem that seems specific to nVidia driver versions post 367… but I’ll be damned it the 378 driver does not now work! I know I had tested previous versions with the iGPU re-enabled, but probably not the last two or three versions.

Today you are my ***-****ed hero. :)