GTX 580 - 375.10 - Weston/EGLStream produces crash in kernel

Hi,

I’m running Arch Linux with the 4.8.6 Kernel and have the weston-eglstream package installed from AUR.
When I boot into console by added “3” to my kernel line and then unload the nvidia-drm module “# modprobe -r nvidia-drm” and then reload it with modeset=1 parameter I can’t start weston with “weston --use-egldevice”. The screen goes black and weston never shows up. When I ssh into this machine and follow the journal with journalctl -f I can see this kernel message:

Nov 02 11:25:57 archlinux kernel: usercopy: kernel memory overwrite attempt detected to ffff8803e8ec7ce0 (<process stack>) (8 bytes)
Nov 02 11:25:57 archlinux kernel: ------------[ cut here ]------------
Nov 02 11:25:57 archlinux kernel: kernel BUG at mm/usercopy.c:75!
Nov 02 11:25:57 archlinux kernel: invalid opcode: 0000 [#2] PREEMPT SMP
Nov 02 11:25:57 archlinux kernel: Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) drm_kms_helper drm syscopyarea sysfillrect sysimgblt fb_sys_fops ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_mod nct6775 hwmon_vid snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic input_leds joydev mousedev hid_roccat_lua hid_roccat_common hid_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp btusb coretemp btrtl btbcm eeepc_wmi iTCO_wdt iTCO_vendor_support btintel kvm_intel asus_wmi sparse_keymap led_class mxm_wmi evdev kvm bluetooth
Nov 02 11:25:57 archlinux kernel:  mac_hid irqbypass snd_hda_intel rfkill crct10dif_pclmul crc32_pclmul snd_hda_codec usbhid crc32c_intel ghash_clmulni_intel snd_hda_core hid snd_hwdep aesni_intel aes_x86_64 lrw gf128mul glue_helper snd_pcm ablk_helper cryptd e1000e snd_timer i2c_i801 intel_cstate snd intel_rapl_perf mei_me psmouse ptp pcspkr i2c_smbus soundcore mei pps_core lpc_ich shpchp fan thermal fjes wmi battery video tpm_tis tpm_tis_core tpm button squashfs sch_fq_codel loop vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) ip_tables x_tables ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sd_mod serio_raw atkbd libps2 ahci libahci libata xhci_pci ehci_pci xhci_hcd ehci_hcd scsi_mod usbcore usb_common i8042 serio [last unloaded: nvidia]
Nov 02 11:25:57 archlinux kernel: CPU: 2 PID: 3591 Comm: weston Tainted: P      D    O    4.8.6-1-ARCH #1
Nov 02 11:25:57 archlinux kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
Nov 02 11:25:57 archlinux kernel: task: ffff8803f6249c80 task.stack: ffff8803e8ec4000
Nov 02 11:25:57 archlinux kernel: RIP: 0010:[<ffffffff81205f5f>]  [<ffffffff81205f5f>] __check_object_size+0x13f/0x1d6
Nov 02 11:25:57 archlinux kernel: RSP: 0018:ffff8803e8ec7c88  EFLAGS: 00010282
Nov 02 11:25:57 archlinux kernel: RAX: 0000000000000062 RBX: ffff8803e8ec7ce0 RCX: 0000000000000000
Nov 02 11:25:57 archlinux kernel: RDX: 0000000000000000 RSI: ffff88041ec8dba8 RDI: ffff88041ec8dba8
Nov 02 11:25:57 archlinux kernel: RBP: ffff8803e8ec7ca8 R08: 000000000003d43f R09: 0000000000000005
Nov 02 11:25:57 archlinux kernel: R10: ffff8803e5382a00 R11: 000000000000037a R12: 0000000000000008
Nov 02 11:25:57 archlinux kernel: R13: 0000000000000000 R14: ffff8803e8ec7ce8 R15: ffff8803e5382a00
Nov 02 11:25:57 archlinux kernel: FS:  00007f05e5e0de80(0000) GS:ffff88041ec80000(0000) knlGS:0000000000000000
Nov 02 11:25:57 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 02 11:25:57 archlinux kernel: CR2: 00007f05e51c34a0 CR3: 00000003f75fb000 CR4: 00000000000406e0
Nov 02 11:25:57 archlinux kernel: Stack:
Nov 02 11:25:57 archlinux kernel:  ffff8803e8ec7ce0 0000000000000008 00007ffc5b4ce350 00000000ffffffea
Nov 02 11:25:57 archlinux kernel:  ffff8803e8ec7cd0 ffffffffa14dc4f1 ffff8803e8ec7dd0 0000000000f00000
Nov 02 11:25:57 archlinux kernel:  ffff8803fd3cd240 ffff880404af7f08 ffffffffa15439e9 ffff8803e8ec7d30
Nov 02 11:25:57 archlinux kernel: Call Trace:
Nov 02 11:25:57 archlinux kernel:  [<ffffffffa14dc4f1>] nvkms_copyin+0x21/0x50 [nvidia_modeset]
Nov 02 11:25:57 archlinux kernel:  [<ffffffffa15439e9>] _nv000272kms+0x69/0x120 [nvidia_modeset]
Nov 02 11:25:57 archlinux kernel:  [<ffffffff810d7198>] ? console_unlock+0x318/0x5f0
Nov 02 11:25:57 archlinux kernel:  [<ffffffffa03c07c6>] ? nvidia_drm_gem_import_nvkms_memory+0x76/0x110 [nvidia_drm]
Nov 02 11:25:57 archlinux kernel:  [<ffffffff810bfd3d>] ? remove_wait_queue+0x4d/0x60
Nov 02 11:25:57 archlinux kernel:  [<ffffffffa053ed40>] ? drm_ioctl+0x200/0x4f0 [drm]
Nov 02 11:25:57 archlinux kernel:  [<ffffffff810bfe14>] ? __wake_up+0x44/0x50
Nov 02 11:25:57 archlinux kernel:  [<ffffffffa03c0750>] ? nvidia_drm_dumb_create+0x190/0x190 [nvidia_drm]
Nov 02 11:25:57 archlinux kernel:  [<ffffffff813eb570>] ? n_tty_open+0xd0/0xd0
Nov 02 11:25:57 archlinux kernel:  [<ffffffff812088d7>] ? __vfs_write+0x37/0x140
Nov 02 11:25:57 archlinux kernel:  [<ffffffff8121c433>] ? do_vfs_ioctl+0xa3/0x5f0
Nov 02 11:25:57 archlinux kernel:  [<ffffffff812276a7>] ? __fget+0x77/0xb0
Nov 02 11:25:57 archlinux kernel:  [<ffffffff8121c9f9>] ? SyS_ioctl+0x79/0x90
Nov 02 11:25:57 archlinux kernel:  [<ffffffff815f7cf2>] ? entry_SYSCALL_64_fastpath+0x1a/0xa4
Nov 02 11:25:57 archlinux kernel: Code: 87 71 81 48 0f 45 d0 48 c7 c6 70 a5 72 81 48 c7 c0 eb 43 73 81 48 0f 45 f0 4d 89 e1 48 89 d9 48 c7 c7 28 0d 73 81 e8 a7 01 f7 ff <0f> 0b 48 89 df e8 57 75 e6 ff 84 c0 0f 84 f8 fe ff ff b8 00 00 
Nov 02 11:25:57 archlinux kernel: RIP  [<ffffffff81205f5f>] __check_object_size+0x13f/0x1d6
Nov 02 11:25:57 archlinux kernel:  RSP <ffff8803e8ec7c88>
Nov 02 11:25:57 archlinux kernel: ---[ end trace 5ad7d5aef591d152 ]---

Seems to be a problem specific to that kernel version. I downgraded to 370.28 and the problem is the same. I installed linux-lts (4.4.28) and the 370.28 kernel module for it and “weston --use-egldevice” works just fine. However with 370.28 this bug seems to be back: https://devtalk.nvidia.com/default/topic/932343/364-19-gtx-580-weston-simple-egl-fails-to-initialize-egl/

EDIT: 375.10 also works as long as I use the Linux LTS Kernel so this is further evidence that it’s specific to 4.8.6. weston-simple-egl still doesn’t work.

nvidia-bug-report.log.gz (96.7 KB)

Thanks for the report.

This is specific to the Linux kernel config option CONFIG_HARDENED_USERCOPY (new in kernel 4.8). It attempts to validate that the kernel address passed to copy_from_user() and copy_to_user() is either on the stack or on the heap (trying to catch bugs where other kernel memory is either copied to user-space, or over-written by user-space).

In the scenario here, memory is safely on the stack, but nvidia-modeset.ko was compiled such that the binary-only part of that kernel module does not contain stack frame information, and CONFIG_HARDENED_USERCOPY cannot recognize that the memory is really on the stack.

In a future release, we’ll allocate this memory on the heap, to avoid this interaction problem. In the meantime, the best work around I can suggest is to rebuild your >= 4.8 kernel without CONFIG_HARDENED_USERCOPY. However, CONFIG_HARDENED_USERCOPY provides very useful checking, so I’d encourage you to go back to a CONFIG_HARDENED_USERCOPY-enabled kernel once an updated NVIDIA driver is available.

Sorry for the trouble.