[Strange partial workaround] nvidia-modeset crash on changing virtual terminal

Hello,

I recently upgraded my NVidia driver to 390.48 and I get an anomalous behaviour when trying to suspend the machine.

The screen becomes black (monitor says : no signal) but the machine does not actually go into suspend mode ; it is also impossible to get it back to normal operation (AltSysRq REISUB works to reboot).

Inspection of journalctl shows

nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
kernel: IP: [<ffffffffc20619ae>] _nv002366kms+0x5e/0x100 [nvidia_modeset]

Attached are the full journalctl from previous boot (the one when I tried to suspend) as well as the result of nvidia-bug-report.sh.

nvidia-bug-report.log.gz (99.8 KB)
journalctl.log (344 KB)

Editing title as this happens on computer shutdown too : session closes, screens gets no signal but shutdown does not process until SysRq-REISUO.

In fact, locking the session (which involves a change of virtual terminal) triggers the bug.

SSH’ing into the machine allowed me to attach the result of nvidia-bug-report.sh after the bug occurred.

Also, rebooting the system via ssh does not work : doing so closes the ssh session but one again does not proceed to system shutdown.
nvidia-bug-report.log.gz (99.5 KB)

In fact this happens even when just changing the virtual terminal manually (Ctr+Alt+F1 for example).

Did you check with R396 driver? Does it also happen if you disable the intel gpu in bios?

Hello and thanks for your suggestions.

I tried today with 396.18 and I get the same problem and similar error messages. Attached are the logs with this version of the driver.

The BIOS configuration utility lets me choose the primary GPU (auto/Intel/Nvidia ; the last is selected) but does not seem to offer an option to completely switch off one of them.
journal.log (4.36 KB)
nvidia-bug-report.log.gz (99.3 KB)

Two things you could try as a workaround (mutual exclusive, only one at a time)

  • either use kernel parameter
nomodeset

to disable the iGPU kernel driver

  • or enable nvidia drm kms, in case of Ubuntu, you have to add a file in /etc/modprobe.d/ containing
options nvidia_XXX_drm modeset=1

with XXX being the major version of the installed nvidia driver.
After that, run

sudo update-initramfs -u

and reboot.

sudo cat /sys/module/nvidia_drm/parameters/modeset

should return ‘Y’ if done right.

To enable nvidia drm kms, I had to edit /etc/modprobe.d/nvidia-graphics-drivers.conf as this file already contains an option to disable drm kms. Alas, it is impossible to get to lightdm with this enabled : in normal mode, the driver crashes and the screen gets no signal ; in recovery mode, the recovery menu is unresponsive and the screen scintillates. I had to use an ssh connection to disable it again.

The nomodeset kernel parameters does not improve things : the same problem happens. But the point where the driver crashes is not the same :

kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
avril 27 00:08:27 yann-Precision-Tower-3620 systemd[1]: Stopped Session c2 of user yann.
kernel NULL pointer dereference at 00000000000000c0
kernel: IP: [<ffffffffc1f2f56e>] _nv002017kms+0xe/0x110 [nvidia_modeset]
kernel: PGD 0 
kernel: Oops: 0000 [#1] SMP 
kernel: Modules linked in: pci_stub vboxpci(OE) vboxdrv(OE) rfcomm bnep binfmt_misc joydev input_leds btusb btrtl btbcm dell_wmi hid_generic btintel sparse_keymap bluetooth dcdbas uas usb_storage dell_smm_hwmon intel_rapl snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic snd_hda_codec_hdmi intel_powerclamp coretemp usbhid hid nvidia_uvm(POE) kvm_intel snd_hda_intel kvm snd_hda_codec irqbypass snd_hda_core snd_hwdep crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_pcm shpchp snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer aes_x86_64 snd lrw gf128mul glue_helper serio_raw mei_me ablk_helper soundcore cryptd mei ie31200_edac edac_core 8250_fintek mac_hid acpi_pad wmi parport_pc ppdev lp parport autofs4 i915_bpo nvidia_drm(POE) nvidia_modeset(POE)
kernel:  intel_ips i2c_algo_bit drm_kms_helper syscopyarea e1000e sysfillrect sysimgblt fb_sys_fops nvidia(POE) ptp psmouse ahci pps_core drm libahci ipmi_msghandler video fjes [last unloaded: vboxnetflt]
kernel: CPU: 5 PID: 1227 Comm: Xorg Tainted: P           OE   4.4.0-121-generic #145-Ubuntu
kernel: Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 2.7.3 01/31/2018
kernel: task: ffff880411a99e00 ti: ffff880416724000 task.ti: ffff880416724000
kernel: RIP: 0010:[<ffffffffc1f2f56e>]  [<ffffffffc1f2f56e>] _nv002017kms+0xe/0x110 [nvidia_modeset]
kernel: RSP: 0018:ffff880416727ac8  EFLAGS: 00010206
kernel: RAX: 0000000000000000 RBX: 0000000000000018 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
kernel: RBP: 0000000000000000 R08: ffff880414bc5408 R09: 0000000000000020
kernel: R10: 0000000000000000 R11: ffff880410eb7e10 R12: ffff880416c11008
kernel: R13: ffff88041a34d808 R14: ffff880416727b8c R15: ffff880410eb7f08
kernel: FS:  00007f4127885a00(0000) GS:ffff88042dd40000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00000000000000c0 CR3: 0000000002e0a000 CR4: 0000000000360670
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Stack:
kernel:  ffff880416727b90 ffffffff811f26eb ee4e4c0706729a07 ffff88041a34d808
kernel:  0000000000000000 ffffffffc1f43dcd ffff880415dc10c0 0000000000000001
kernel:  ffff880416c11008 ffff88041a34d808 ffff880416a99d88 ffff880416c11008
kernel: Call Trace:
kernel:  [<ffffffff811f26eb>] ? __slab_free+0xcb/0x2c0
kernel:  [<ffffffffc1f43dcd>] ? _nv002291kms+0x1d/0xe0 [nvidia_modeset]
kernel:  [<ffffffffc1f492c0>] ? _nv002060kms+0xa0/0x100 [nvidia_modeset]
kernel:  [<ffffffffc1f0b739>] ? _nv000232kms+0x109/0x140 [nvidia_modeset]
kernel:  [<ffffffffc1f0b7fe>] ? nvKmsClose+0x8e/0x170 [nvidia_modeset]
kernel:  [<ffffffffc1f08cb1>] ? nvkms_close_common+0x21/0x60 [nvidia_modeset]
kernel:  [<ffffffffc1f08d0a>] ? nvkms_close+0x1a/0x30 [nvidia_modeset]
kernel:  [<ffffffffc00fc39f>] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
kernel:  [<ffffffff812159d7>] ? __fput+0xe7/0x230
kernel:  [<ffffffff81215b5e>] ? ____fput+0xe/0x10
kernel:  [<ffffffff810a16e6>] ? task_work_run+0x86/0xb0
kernel:  [<ffffffff81085f01>] ? do_exit+0x2e1/0xb00
kernel:  [<ffffffff8105b7b3>] ? x2apic_send_IPI_mask+0x13/0x20
kernel:  [<ffffffff81051943>] ? native_smp_send_reschedule+0x53/0x70
kernel:  [<ffffffff810867a3>] ? do_group_exit+0x43/0xb0
kernel:  [<ffffffff81093004>] ? get_signal+0x294/0x600
kernel:  [<ffffffff8102e577>] ? do_signal+0x37/0x6f0
kernel:  [<ffffffff810cd142>] ? up+0x32/0x50
kernel:  [<ffffffffc1f08d70>] ? nvkms_ioctl_common+0x50/0x80 [nvidia_modeset]
kernel:  [<ffffffffc1f08e11>] ? nvkms_ioctl+0x71/0xa0 [nvidia_modeset]
kernel:  [<ffffffffc00fc082>] ? nvidia_frontend_compat_ioctl+0x42/0x50 [nvidia]
kernel:  [<ffffffffc00fc09e>] ? nvidia_frontend_unlocked_ioctl+0xe/0x10 [nvidia]
kernel:  [<ffffffff81227a6f>] ? do_vfs_ioctl+0x2af/0x4b0
kernel:  [<ffffffff810034fc>] ? exit_to_usermode_loop+0x8c/0xd0
kernel:  [<ffffffff81003c7e>] ? syscall_return_slowpath+0x4e/0x60
kernel:  [<ffffffff8184f930>] ? int_ret_from_sys_call+0x25/0x9f
kernel: Code: 00 00 03 87 2c 05 00 00 44 39 c0 77 c6 f3 c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 53 48 89 fb 48 83 ec 18 48 85 ff 74 45 <48> 8b 87 a8 00 00 00 48 8b 8f a0 00 00 00 8b 97 bc 00 00 00 89 
kernel: RIP 
kernel:  [<ffffffffc1f2f56e>] _nv002017kms+0xe/0x110 [nvidia_modeset]
kernel:  RSP <ffff880416727ac8>
kernel: CR2: 00000000000000c0
kernel: ---[ end trace 4c8460f20b5bb8a8 ]---
kernel: Fixing recursive fault but reboot is needed!

Problem still happens with latest kernel 4.4.0-122-generic

I guess that choosing Nvidia put you into nvidia-only mode.
I use Mint 18.3 but I have activated the hwe stack, which is recommended for desktop users. This will update your kernel, your xorg stack. if you are on 4.4 I wonder how old your xorg is?

you can do it with this sort of install:

sudo apt-get install  linux-generic-hwe-16.04-edge linux-headers-generic-hwe-16.04-edge linux-image-generic-hwe-16.04-edge linux-tools-generic-hwe-16.04-edge xserver-xorg-core-hwe-16.04

It is well-documented for ubuntu and as you can see it is the standard repositories.

For me, on two optimus laptops, Mint works flawlessly.

by the way, I suggest you make a zz-nvidia.conf file in /etc/modprobe.d and activate modeset there.
The one you have used will be overwritten on upgrades.
zz ensures it is loaded last, therefore overriding the contents of the installed file.
I have this:

options nvidia_384_drm modeset=1
options nvidia_396_drm modeset=1

Thanks for your suggestion ; I did not know about HWE.

Alas, the same problems still occur.
nvidia-bug-report.log.gz (89 KB)
journalctl.log (169 KB)

Still trying to make this work with now XUbuntu 18.04 (kernel 4.15.0-23-generic, Xorg 1.19.6) and driver 390.67.

The Nvidia DRM KMS still does not work.

Operation without it still exhibit the described problem.

However there is a strange partial workaround.

While fiddling with grub configuration, I made a typo at some point that caused it to complain about my GRUB_GFXPAYLOAD_LINUX and booting in “blind mode”.

And then suspending works ! So now I set GRUB_GFXPAYLOAD_LINUX=blind, which comically is not a valid choice for grub but causes it to boot in blind mode anyway as a result of this invalidity.

Hibernation seems to more or less work (image is written to swap and computer turns off), but not resuming from it (it just boots normally, discarding the written image). And I cannot efficiently debug this because of the “blind mode” boot.
nvidia-bug-report.log.gz (104 KB)
journalctl.log (176 KB)

And some news : the Nvidia DRM KMS works with 396.24.02 if, and only if, grub boots with “blind mode”.

So what the hell is grub doing when not in blind mode that causes havoc ?

This is to inform that theses crashes do not happen when Legacy boot options are enabled in the BIOS (this in spite of the fact that my whole system boots in UEFI mode anyway).

Hello, My GTX 750 Ti got same problem here, I’m running it with opensuse 15.0 and with the 396.54.05 driver, in UEFI mode, the driver crashed when logout or reboot the system…How to solve this problem? Thanks.

As extra information, I can confirm the workaround also for suspend to RAM + GLVND