Xid 61 (black screen on startup) Ubuntu 18.04 GTX 1060 mobile

artyom.szasa · July 30, 2018, 2:49pm

HW:ROG STRIX GL503VM-FY022 / nVIDIA GeForce GTX 1060 6GB
Kernel:4.15.0-29-generic

Issue: black screen (nothing except Ctrl+Alt+PrintScreen+B works) before gdm/lightdm on Ubuntu 18.04.

Drivers worked for few months but after random reboot (no drivers were updated, only grub) I’ve got a blank screen.
Purging drivers and using nouveau works for work but this way my laptop is useless for gaming…

Fortunately ssh works so I can connect remotely and save logs. In all cases listed below I get the exact same error in dmesg (grep NVRM):

[    1.429482] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  3xx.xx (using threaded interrupts)
[    6.425963] NVRM: GPU at PCI:0000:01:00: GPU-ca4d2121-189c-752b-9cba-302ed81038d4
<b>[    6.425965] NVRM: Xid (PCI:0000:01:00): 61, 1d5e(356c) 00000000 00000000</b>

and (grep nvidia):

[    1.397521] nvidia: loading out-of-tree module taints kernel.
[    1.397888] nvidia: module license 'NVIDIA' taints kernel.
[    1.420726] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    1.427943] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[    1.428625] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    1.436900] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.48  Wed Mar 21 23:48:34 PDT 2018
[    1.439074] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    1.439528] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    3.304333] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 239
[    6.363607] nvidia-modeset: Allocated GPU:0 (GPU-ca4d2121-189c-752b-9cba-302ed81038d4) @ PCI:0000:01:00.0
[b][   13.435125] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[   16.219211] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0
[/b]

I’ve tried:

GDM/lightDM purged reinstalled
purging and reinstalling nvidia drivers from ubuntu stable/updates repository (390.48-0ubuntu3)
purging and reinstalling nvidia drivers from graphics-drivers/ppa (both 396.45-0ubuntu0~gpu18.04.2 and 390.77-0ubuntu0~gpu18.04.1
purging and reinstalling drivers from nvidia.com (396.45 and 384.130)
adding various kernel variables from grub (vga=0 rdblacklist=nouveau nouveau.modeset=0 nvidia-drm.modeset=1)
adding options nvidia_xxx_drm modeset=1 to modprobe

nouveau was blacklisted in all cases. But I’ve always got the same (blackscreen and):

<b>[    6.425965] NVRM: Xid (PCI:0000:01:00): 61, 1d5e(356c) 00000000 00000000</b>

and after few seconds:

<b>[   16.219211] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000987d:0:0</b>

I’ve found that Xid = 61 means Internal micro-controller breakpoint/warning yet no information about how to futher debug the issue.

Have anyone faced the same issue? What can I do to debug this issue?

Any help would be useful as without the soultion I’m stuck with using nouveau…

dmesg.log.gz (19.9 KB)
nvidia-bug-report.log.gz (111 KB)

generix · July 31, 2018, 9:03am

If this appeared suddenly it would point to hardware issue.
Did you try using kernel parameter
acpi_osi=! acpi_osi=“Windows 2009”
?

artyom.szasa · July 31, 2018, 9:57am

I tried now but the result is the same - blank screen and Xid 61 in logs.
dmesg output and bug report in attachment.

If it is a hardware problem, I have warranty on the laptop as I bought it ~3 months ago. But everything including 3d works with nouveau (games work just the performance is so poor it makes experience awful) and I’m not sure it will convince ASUS about hardware issue.

Is there any way to get more information on error behind xid 61?
nvidia-bug-report.log.gz (132 KB)
dmesg.log.gz (18.2 KB)

generix · July 31, 2018, 2:20pm

Unfortunately not beyond that the display engine of the gpu fails to set a mode.
Since grub also messes around with graphics/modes, did you try downgrading grub? Or playing around with the matching parameters, like
GRUB_GFXPAYLOAD_LINUX=keep (or ‘text’ or ‘none’)
GRUB_GFXMODE=1920x1080
and kernel parameters
video=efifb:off
or video=efifb:height:1920,width:1080

emmanuel.anne · July 31, 2018, 4:09pm

Exactly the same kind of problem here : sometimes when starting X, the video card just hangs, you can access the machine by the network or powering off cleanly using acpi (after a delay because X is stuck), but that’s all.
I have tried removing msi using the NVreg_EnableMSI=0 kernel module parameter, it seems to happen less often with this, but it still happens, that’s why I took the time to post about it.
It’s some brand new hardware and a new installation, kernel 4.14.52, nvidia 396.24, NVIDIA GPU GeForce GTX 1060 3GB gpu.
There are a few messages in the kernel log out of the ordinary but nothing directly related to this (I guess there is the same error if waiting 5 minutes before a stuck computer, but I never tried that yet) :

Jul 31 17:35:22 gentoo kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Jul 31 17:35:22 gentoo kernel: caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs

Not sure the sanity check message is related, but since it happens just before, it’s possible I guess.
What else ? It’s an amd ryzen 7 on a MS-7A34/B350 TOMAHAWK motherboard. I have always used nvidia cards until now and never saw something like this before.
Symptoms exactly similar as the 1st post, when the card doesn’t hang then the performance is excellent everywhere and there is no problem to notice.
The obvious solution would be to stay in X for ever without ever powering off the computer, but I’d prefer to avoid this one !

edit: some more info. Uefi boot with efi frame buffer, this one always works well but it’s not handled by this driver, it’s when the nvidia driver handles the mode setting for X that sometimes the problem happens. I upgraded to the nvidia driver 396.45, still happens.
I tried various workarounds, everything fails, including :

adding acpi_osi=! acpi_osi=“Windows 2009” rcutree.rcu_idle_gp_delay=1 to the kernel command line
trying some settings related to the pcie bus in the kernel or bios, not a lot of them anyway, no effect.
The only thing which seems to have some effect is to shutdown the computer instead of just rebooting it when it happens, there are better chances that it won’t happen again in the next boot this way, but now it could just be some luck, I can’t be sure about that.
Anyway very frustrating, no message in the logs, I even tried to wait 20 minutes last night after it hung, I don’t get any message related to any timeout anywhere, nothing. The xorg log just stops just before the nvidia driver is supposed to give all the info about the video mode, there :
17.893] (**) NVIDIA(0): Depth 24, (–) framebuffer bpp 32 | -----------------------------------------------------------------------------------------------------
[ 17.893] (==) NVIDIA(0): RGB weight 888 | -----------------------------------------------------------------------------------------------------
[ 17.893] (==) NVIDIA(0): Default visual is TrueColor | -----------------------------------------------------------------------------------------------------
…
the problems seems to be related to the nvidia-modeset module, but I can’t say anything more about that… !

edit of the day : found that setting
options nvidia-drm modeset=1
has the effect that when setting the mode fails during boot, the boot continues and I end up in the efi console, which is better than a stuck computer at least.
In this case usually udev is stuck too, but it’s probably a side effect (X is requesting driver informations from udev, but since it’s stuck trying to set the video mode, it hangs udev too).
Using systemd-udevd here with openrc, it’s hard to get some useful logs from this udev anyway.
X appears to be stuck too in D state, so at this point I can either do stuff in the console, or reboot, I can’t just restart X.
Better than nothing, but still not perfect… !

… and finally apparently just pre-loading the nvidia module (and I loaded nvidia-drm at the same time) fixes everything. This machine has 8 cores + 8 hyper threads, which makes 16 virtual cpus and everything starts in parrallel, it seems to create some problems for X here, maybe it’s not directly related to the nvidia driver after all.

artyom.szasa · August 5, 2018, 10:14pm

None of the above has helped. I’m still stuck with Xid 61.

Furthermore, I’ve installed windows to go on pendrive and in Windows Home edition my card works perfectly (at least for 3d benchmarks).

So it cannot be hardware problem…

I’ve also tried downgrading grub and tried fresh install of Ubuntu 18.04 but in linux I’m constantly getting Xid 61…

Any response from nvidia? It is quite disappointing that there is no response to a submitted bug report!

generix · August 6, 2018, 8:43am

Did you try with mbr install, avoiding uefi?
Don’t know if your bios give you the option to switch gpus to hybrid mode, make look for that.
A more esotheric idea would be setting in the screen section of your xorg.conf

Option "FlatPanelProperties" "Dithering=Disabled"

artyom.szasa · September 7, 2018, 5:58pm

Thank you for suggestions, but once again nothing of above had helped :(

My laptop gives me no possibility to disable UEFI. There quite minimal set of options, mostly security. So MNB is no option for me.

I’m completely stuck using nouveau…

It is definitely not a hardware issue – card works great on windows (10+ hours CivVI without any issue).

It’s not a “broken package” issue – clean ubuntu install gains the exact same result…

Error message it not verbose:

NVRM: Xid (PCI:0000:01:00): 61, 1d5e(356c) 00000000 00000000

And no possibility to further investigate the issue. It may not be nvidia issue but e.g. pcie power management, but without little support on this message I can only guess…

Maybe nvidia team could help, but despite two submitted bug reports I’ve got not even a short reply in a month…

Well good work! I didn’t notice any label “windows only” on any nvidia cards. Controversially, nvidia claims to support linux. Well, I wouldn’t call this support…

The only conlusion I’ve got is not ever to buy nvidia again and advise as much people as I can againts it.

artyom.szasa · September 28, 2018, 7:08am

I’ve tried the new beta driver (410.57) and using that driver gives the same result but different logs:

NVRM: GPU at PCI:0000:01:00: GPU-ca4d2121-189c-752b-9cba-302ed81038d4
NVRM: Xid (PCI:0000:01:00): 62, 17b7(805c) 00000000 00000000

Everything else remains the same (black screen, ERROR: GPU:0: Idling display engine timed out is logs)

howarth.mailing.lists · October 13, 2018, 4:29am

I see the exact same errors on a MacPro 3,1 with a GTX 680 flashed with Mac ROMs under Fedora 28 and the rpmfusion 396.54. At least it is a relief to know that this isn’t just a bug restricted to Apple hardware.

howarth.mailing.lists · October 15, 2018, 10:53pm

Debugging this issue on a MacPro 3,1, I noticed that the nvidia_drm.ko depends on the ipmi modules of which the ipmi device isn’t created and ipmi_si isn’t loadable. Folks on PC hardware showing the errors described here might want to check if the ipmi_si module is loaded and if it is behaving properly as described in [url]https://www.thomas-krenn.com/en/wiki/Configuring_IPMI_under_Linux_using_ipmitool[/url].

mikechen6688 · August 10, 2020, 10:29am

Please see my solution to the issue of NVRM: Xid (PCI:0000:01:00): 61…

Environment:

Ubuntu 18.04 LTS
CUDA Driver 450.57
CUDA Toolkit 11.0
cuDNN 8.0.1

Part One . Modify the grub

1. Open grub

$ sudo gedit /etc/default/grub

2. Modify the content

Change

GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash”
GRUB_GFXMODE=640x480

to

GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash nouveau.modeset=0”
GRUB_GFXMODE=1920x1080

Part Two. Install v5.4 kernel patches

1. Verify the system kernel

$ uname -r
5.4.0-42-generic

2. v5.4 mainline build

1). Download the patches from the weblink

ubuntu kernel: Index of /~kernel-ppa/mainline/v5.4

2). Install the patches:

Build for amd64 succeeded (see BUILD.LOG.amd64):

$ sudo dpkg -i linux-headers-5.4.0-050400_5.4.0-050400.201911242031_all.deb

$ sudo dpkg -i linux-headers-5.4.0-050400-generic_5.4.0-050400.201911242031_amd64.deb

$ sudo dpkg -i linux-image-unsigned-5.4.0-050400-generic_5.4.0-050400.201911242031_amd64.deb

$ sudo dpkg -i linux-modules-5.4.0-050400-generic_5.4.0-050400.201911242031_amd64.deb

Part Three. Check the GPU Status

1. Check whether it has errors

$ dmesg -l err

If it pops up the message such as “Failed to reset PPM”, please do not worry it and use the following command to remove the error. It many need to reboot the system two times.

$ sudo reboot

2. Check NVRM

$ dmesg | grep NVRM

[ 3.169230] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.57 Sun Jul 5 14:42:25 UTC 2020

t.platzer · August 11, 2020, 9:57am

If you have Xid 61 troubles, you might check out this thread:

There are quite a few people here with the (maybe) same problem and it goes back a few months.

After much experimentation a user found out that the problem happens when the card goes into a low power state that switches down to PCIE Gen 1, and if the card tries to go up to PCIE Gen 2 or 3 afterwards sometimes the dreaded Xid 61 is encountered.

The temporary fix found was to limit the frequencies of the card to avoid to ever go down to PCIE 1:
sudo nvidia-smi -lgc 600,2130

Please not that those values are for my RTX 2070 Super, depending on your card yours may vary. First is minimum freq, second is maximum allowed.

It worked great for me and a few others (so far), but some people may still experience issues. No official fix from nvidia for the meantime.

Here is a command to check your current values:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.current,temperature.gpu,clocks.gr,clocks.mem,power.draw --format=csv -l 60

Hope that helps, artyom.

[edited for typos]