System seems locked while rebooting with Linux 5.2.1 and nvidia drivers 430.34 or 430.26

GoofyX · July 18, 2019, 7:15am

After upgrading my kernel to 5.2.1, I have problems shutting down or rebooting the system. The distro is Gentoo, with 64-bit Linux 5.2.1 and systemd-241-r4 and the card is 1050 Ti with 4GB of VRAM. I’m using KDE plasma with sddm as the graphical login manager. The system works fine inside plasma, but when I click on reboot or shut down, X is killed, the initial virtual console 1 is displayed (it shows the systemd messages upon booting), but the screen is frozen. I cannot Ctrl+Alt+F2 (or other console) and Ctrl+Alt+Del does nothing. The system is not locked up, since I can ping it from another PC. It might shutdown eventually, but it will take several minutes. I had to use the SysRq key to reboot the system a couple of itmes.

The previous 5.1.16 kernel worked fine with the same drivers (tried both 430.26 and 430.34). It seems that the nvidia drivers is the source of the problem, since I logged inside a virtual console, killed X, unloaded all nvidia(_drm, _modeset) modules and issued a shutdown command, which worked instantly.

I can provide more details if needed or send the output from nvidia-bug-report.

Thanks.

generix · July 18, 2019, 3:35pm

Can you log in over ssh and check which processes hang and create an nvidia-bug-report.sh to upload?

GoofyX · July 18, 2019, 3:47pm

Where should I upload it? Not here I hope, since the log contains sensitive information.

GoofyX · July 18, 2019, 4:00pm

I’ve sent you a DM with the link to the log file.

GoofyX · July 21, 2019, 7:48pm

Any news on this one…?

birdie · July 22, 2019, 8:40am

Works for me.

Must be specific to your HW configuration.

Running Linux 5.2.2 + NVIDIA 430.26.

@generix doesn’t work for NVIDIA - he’s just an average guy who helps people.

Try rebooting after pressing Ctrl + Alt + F1. If there’s a kernel panic you’ll most likely be able to snap it - then you can upload an image.

There’s a similar thread here https://devtalk.nvidia.com/default/topic/1052114/linux/kernel-panic-during-poweroff-freezes-the-system/ albeit the other person is running Linux 5.1.x.

GoofyX · July 22, 2019, 3:16pm

I somehow thought he workds for Nvidia. No luck I guess, is any Nvidia developer watching these forum threads…?

The system is not frozen or locked up. If I choose to logout instead of shutdown, you can ssh to it normally. It’s just that the first screen console (the modeset screen) is stuck there and pressing Ctrl+Alt+F does nothing. If I wait long enough (eg. 3-4 minutes), the system will shut down (or logoff) eventually.

I had to revert back to 5.1.18, since this is rather annoying.

Could it be some kernel misconfiguration?

birdie · July 22, 2019, 6:37pm

A kernel “misconfiguration” sounds unlikely.

What it could be:

A kernel regression/bug/new internal behavior/new behavior in regard to your HW configuration - in this case bisect will help, though it’s an arduous task.
Some other package you’ve updated in the meantime you’ve totally forgotten about
Some reconfiguration of your userspace (systemd, login manager, kernel systemctl parameters, etc.)
A compiler issue (this happens quite rarely) or weird compilation flags (e.g. Gentoo users love to over-optimize their kernels by using some experimental compilation flags)
Something else entirely.

Considering you’re the only one with this issue so far, you’re on your own.

I’d recommend using off-the-shelf distros which come with precompiled components - a lot fewer chances of hitting the issues like your. At least if you hit them, there’ll be other users who could chime in and help debug the problem.

generix · July 24, 2019, 10:01am

Nvidia staff have green posts and an nvidia logo.
The nvidia-bug-report.log upload was already deleted when I noticed it.
Things to do to troubleshoot:

switch to vt1, then issue systemctl poweroff and watch where it hangs.
wait until it shuts down, on reboot, run
journalctl -b-1 --no-pager |tail -n20
to get info about what happened last.

GoofyX · July 28, 2019, 8:37am

I’m attaching a log here.
log.gz (21.2 KB)

dodo.godlike · July 29, 2019, 5:55pm

Hello everyone.

I have the same problem, my system is Arch Linux with the following packages

linux 5.2.3.arch1-1
nvidia 430.34-3
gdm 3.32.0+2+g820f90f5-1

My system has

Intel I7 9700k
Gigabyte Z390 Aorus Elite
Nvidia 2070 RTX (ROG Strix)

In my case the GDM process is being kept alive by the Xorg session, which in turn is being kept alive by the nvidia driver. I managed to capture a journalctl log with a stack trace from the nvidia driver

Jul 23 22:08:26 tornio kernel: INFO: task Xorg:794 blocked for more than 122 seconds.
Jul 23 22:08:26 tornio kernel:       Tainted: P           OE     5.2.1-arch1-1-ARCH #1
Jul 23 22:08:26 tornio kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 23 22:08:26 tornio kernel: Xorg            D    0   794      1 0x00400084
Jul 23 22:08:26 tornio kernel: Call Trace:
Jul 23 22:08:26 tornio kernel:  ? __schedule+0x27f/0x6d0
Jul 23 22:08:26 tornio kernel:  schedule+0x3d/0xc0
Jul 23 22:08:26 tornio kernel:  schedule_timeout+0x29b/0x3d0
Jul 23 22:08:26 tornio kernel:  wait_for_common+0xeb/0x190
Jul 23 22:08:26 tornio kernel:  ? wake_up_q+0x70/0x70
Jul 23 22:08:26 tornio kernel:  flush_workqueue+0x15a/0x450
Jul 23 22:08:26 tornio kernel:  ? nv_acpi_notify+0x30/0x30 [nvidia]
Jul 23 22:08:26 tornio kernel:  acpi_remove_notify_handler+0x1f5/0x2e2
Jul 23 22:08:26 tornio kernel:  nv_acpi_remove_one_arg+0xfc/0x140 [nvidia]
Jul 23 22:08:26 tornio kernel:  acpi_device_remove+0x5d/0xb0
Jul 23 22:08:26 tornio kernel:  device_release_driver_internal+0xdb/0x1b0
Jul 23 22:08:26 tornio kernel:  driver_detach+0x44/0x7c
Jul 23 22:08:26 tornio kernel:  bus_remove_driver+0x51/0xc4
Jul 23 22:08:26 tornio kernel:  nv_acpi_uninit+0xa7/0xf0 [nvidia]
Jul 23 22:08:26 tornio kernel:  nvidia_close+0x28d/0x2e0 [nvidia]
Jul 23 22:08:26 tornio kernel:  nvidia_frontend_close+0x38/0x60 [nvidia]
Jul 23 22:08:26 tornio kernel:  __fput+0xae/0x210
Jul 23 22:08:26 tornio kernel:  task_work_run+0x93/0xb0
Jul 23 22:08:26 tornio kernel:  exit_to_usermode_loop+0xba/0xc0
Jul 23 22:08:26 tornio kernel:  do_syscall_64+0x168/0x1b0
Jul 23 22:08:26 tornio kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 23 22:08:26 tornio kernel: RIP: 0033:0x7f9286d6b878
Jul 23 22:08:26 tornio kernel: Code: Bad RIP value.
Jul 23 22:08:26 tornio kernel: RSP: 002b:00007ffe01ccc828 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Jul 23 22:08:26 tornio kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f9286d6b878
Jul 23 22:08:26 tornio kernel: RDX: 00007ffe01ccc800 RSI: 0000000000000000 RDI: 000000000000000b
Jul 23 22:08:26 tornio kernel: RBP: 00007f92854b5084 R08: 00007ffe01ccc830 R09: 00007ffe01ccc83c
Jul 23 22:08:26 tornio kernel: R10: fffffffffffffba8 R11: 0000000000000246 R12: 000055e8abb4ca40
Jul 23 22:08:26 tornio kernel: R13: 00007f92854b5088 R14: 00000000c1d00001 R15: 00000000c1d00001
Jul 23 22:08:45 tornio systemd[775]: at-spi-dbus-bus.service: State 'stop-sigterm' timed out. Killing.
Jul 23 22:08:45 tornio systemd[775]: at-spi-dbus-bus.service: Killing process 1254 (at-spi-bus-laun) with signal SIGKILL.
Jul 23 22:08:45 tornio systemd[775]: at-spi-dbus-bus.service: Main process exited, code=killed, status=9/KILL
Jul 23 22:08:45 tornio systemd[775]: at-spi-dbus-bus.service: Failed with result 'timeout'.

Also this problem does not occur on Ubuntu 18.04 installed on the same machine (Ubuntu with default kernel and latest 430.34 drivers).
For more information, I made a GDM bug report with further details here https://gitlab.gnome.org/GNOME/gdm/issues/503

GoofyX · July 29, 2019, 6:32pm

Your stack trace reports that you run Linux 5.2.1 though.

I have created a bug with Nvidia and posted the link to this thread here for further information.

dodo.godlike · July 29, 2019, 7:32pm

I have this problem since kernel 5.1.* (which is the first kernel I tested on this relatively new PC). That log is from kernel 5.2.1, but the problem persists with latest 5.2.3. As stated in my previous post, this does not happen on Ubuntu 18.04 with default kernel and latest nvidia drivers (430.34), so this makes me think it’s a kernel regression.

Tomorrow I’ll test the LTS kernel which is on 4.19.61, to see if it is a recent regression.

GoofyX · July 29, 2019, 7:42pm

For me the problem exists only while running 5.2.* kernels. I’m on 5.1.20 at the moment, which runs fine.

dodo.godlike · July 29, 2019, 7:54pm

I just tested LTS kernel and it works without problems.

I checked the first log from the machine, which dates back to 01 July 2019, I was running 5.1.15 with 430.26 drivers and I’m sure I had the same problem because the last two lines of the log report

Jul 01 19:12:59 archlinux systemd[1]: session-2.scope: Stopping timed out. Killing.
Jul 01 19:12:59 archlinux systemd[1]: session-2.scope: Killing process 992 (Xorg) with signal SIGKILL.

amrits · August 1, 2019, 10:56am

Prepared setup with below configuration but not able to repro issue so far.

Precision T7610 + Genuine Intel(R) CPU @ 2.30GHz + Fedora 29 + Kernel 5.2.4 + KDE Plasma + GeForce GTX 1050Ti + 430.34

Can you please provide below information so that I can re-attempt for repro.

Please attach nvidia bug report in repro state if system is accessible over ssh otherwise provide in working state.
Provide kernel config file which repro issue.
Short video clip highlighting issue.

dodo.godlike · August 2, 2019, 10:06am

Unfortunately I’m not able to access the machine for almost the entire month, so I’m not able to provide the nvidia bug report right now.

I’m using the default Archlinux linux package, so the kernel config file is here Groups · Explore · GitLab
For the reason above, I’m not able to produce a video until end of August. But I can link to my previous bug reports containing some logs and screens

System hangs on shutdown · Issue #12967 · systemd/systemd · GitHub
GDM process is being kept alive by Xorg after stopping the service, preventing system shutdown (#503) · Issues · GNOME / gdm · GitLab

GoofyX · August 17, 2019, 1:20pm

Today I compiled Linux 5.2.9 with newer Nvidia drivers (435.17) and the issue still persists.

I’ve already opened a bug with Nvidia and at the moment it’s frustrating, because I cannot add a comment to the bug.

amrits · August 17, 2019, 2:18pm

Hi GoofyX,

If I am not mistaken, you had uploaded bug report and kernel config file on bug which you raised and I tried with same config but not able to replicate issue yet.
I updated same in the bug and you should have been received notification.

I am currently looking for system with processor i5 or i7 and will try to repro again, will keep updated on the same.

GoofyX · August 17, 2019, 3:09pm

Hi amrits, yes I opened the bug some time ago. I posted a comment here, because for some reason, I couldn’t post a comment there today, the system wouldn’t accept it. I’ve sent you by private message a link to the video showing the issue with the latest kernel and nvidia drivers.

What bothers me is that -in my case- the problem exists with kernel 5.2 only. Unfortunately, 5.1 is EOL and the latest LTS kernel is 4.19, so I’m stuck with either 4.19 or 5.2 (but having to use SysRq key to reboot the system).

Thank you for your time investigating this.