black screen with Mac version of GTX 680

howarth.mailing.lists · October 10, 2018, 2:25pm

Does anyone know of any reason why the Mac ROM’ed version of the EVGA GTX 680 should be problematic under the current Nvidia linux drivers compared. to the PC ROM’ed version? I have no issues with the nouveau drivers but the Nvidia 396 and 390 drivers end up producing a black screen instead of the gdm greeter for Ubuntu bionic/cosmic and Fedora 28. This is on a 2008 MacPro with HD Cinema Display attached by the DVI port. When the black screen appears, it is almost as if the backlighting has gone out.
nvidia-bug-report.log_hdcinema_display_dvi.gz (37.6 KB)
nvidia-bug-report.log_vizio_tv_hdmi.gz (38.3 KB)
nvidia-bug-report.log_hdcinema_display_dvi_no_appledisplay_module.gz (37.9 KB)
nvidia-bug-report.log_hdcinema_display_dvi_no_appledisplay_hid_apple_applesmc_modules.gz (37.2 KB)
nvidia-bug-report-340.107.log.gz (207 KB)
nvidia-bug-report.log-410.57-nvidia-drm.modeset_is_1-gdm_wayland_on.gz (1.01 MB)
nvidia-bug-report.log-410.57-nvidia-drm.modeset_is_0-gdm_wayland_off.gz (1 MB)
xorg.conf.zip (841 Bytes)

generix · October 10, 2018, 5:12pm

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post will reveal a paperclip icon.

howarth.mailing.lists · October 13, 2018, 12:54am

Okay, I have attached the output from nvidia-bug-report.sh run on the same configuration for both the dvi and hdmi connections to an EVGA GTX680 02G-P4-2684-KR flashed to the official Mac ROM images from the Mac version of the card in a MacPro 3,1 (2008).

The linux installation tested was Fedora 28 updated to the current package set with the rpmfusion 396.54-2 installed as well as the mutter 3.28.3-4.fc28 from testing to capture the latest Wayland fixes for gdm.

My read of the nvidia-bug-report logs are that the problem is…

Oct 12 18:26:01 localhost.localdomain kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 238
Oct 12 18:26:01 localhost.localdomain kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).
Oct 12 18:26:01 localhost.localdomain kernel: NVRM: This can occur when a driver such as: 
                                              NVRM: nouveau, rivafb, nvidiafb or rivatv 
                                              NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Oct 12 18:26:01 localhost.localdomain kernel: NVRM: Try unloading the conflicting kernel module (and/or
                                              NVRM: reconfigure your kernel without the conflicting
                                              NVRM: driver(s)), then try loading the NVIDIA kernel module
                                              NVRM: again.

This is weird because I don’t see any other kernel modules loaded. However there is a appledisplay module that handles brightness control adjustments from the keyboard on the display.

I also configured this test drive to omit ‘rhgb quiet’ from the grub kernel options and used ‘plymouth-set-default-theme details; dracut -f’ to obtain as verbose output as possible during the boot. The text which occurs prior to installing the nvidia package isn’t shown afterwards during boots. Under the nvidia driver, the white hyphen is displayed in the middle of the screen with no further output until under the DVI connection the display goes dark and under the HDMI connection the monitor reports connection lost.

howarth.mailing.lists · October 13, 2018, 1:53am

Okay, the ‘NVRM: The NVIDIA probe routine was not called for 1 device(s).’ warnings are suppressed if I boot the kernel with the ‘rd.driver.blacklist=appledisplay modprobe.blacklist=appledisplay’ but I still get the black screen. I have attached the nvidia-bug-report for that combination.

There still are two other Apple specific kernel modules that I can disable, applesmc and hid_apple.

howarth.mailing.lists · October 13, 2018, 2:26am

I still see the black screen when all three Apple specific kernel modules are blacklisted to not load (appledisplay, hid_apple, applesmc) and the nvidia-bug-report is attached for that combination. It appears that appledisplay is the only kernel module so far that showed a distinct improvement in the nvidia driver loadings but doesn’t solve the black screen issue by itself.

howarth.mailing.lists · October 13, 2018, 4:25am

Looking through the last log for the boot with all three Apple specific kernel modules disabled, I noticed this line…

Oct 12 21:59:03 localhost.localdomain kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.

Oct 12 22:03:38 localhost.localdomain kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
Oct 12 22:05:02 localhost.localdomain kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device Apple Cinema HD (DVI-D-0)
Oct 12 22:08:38 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917d:0:0
Oct 12 22:09:14 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917d:0:0
Oct 12 22:10:14 localhost.localdomain kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917d:0:0

which are identical to this unresolved report of black screens on a GTX 1060 among other cards…

[url]https://devtalk.nvidia.com/default/topic/1037997/xid-61-black-screen-on-startup-ubuntu-18-04-gtx-1060-mobile/?offset=9[/url]

howarth.mailing.lists · October 13, 2018, 4:47am

From the thread above, I tried changing ‘nvidia-drm.modeset=1’ to ‘nvidia-drm.modeset=0’ as well as NVreg_EnableMSI=0 and neither change has any impact on the black screen bug.

howarth.mailing.lists · October 13, 2018, 4:54am

Another thread which seems similar is [url]https://devtalk.nvidia.com/default/topic/968193/linux/-solved-367-378-13-980m-ubuntu-16-10-error-gpu-0-idling-display-engine-timed-out/1[/url]

howarth.mailing.lists · October 13, 2018, 5:10am

One permutation of this bug is that the ‘black screen’ bug is converted to a ‘black screen with white hyphen in the middle and backlighting still on’ for both runlevel 3 and ‘single’ mode.

generix · October 13, 2018, 4:06pm

I think there’s not much more to do besides downgrading to a 340 driver or trying a different distro, sometimes the redhat kernels have strange issues with the proprietary driver.
If nothing helps, the driver just doesn’t get along with that card. You could then check if that’s a vbios issue by flashing the original one.

howarth.mailing.lists · October 13, 2018, 4:43pm

I actually have tried numerous combinations under different Linux distros.

Current Fedora 28 with the rpmfusion and with negativo17 nvidia packaging.
Ubuntu bionic and cosmic with the stock nvidia packaging and ppa:graphics-drivers/ppa packaging for both 390 and 396.

Downgrading to 340 isn’t much of an option since that eliminates the use of a modern kernel without custom patching. Where can I find a definitive listing of all of the kernel options supported for the 390/396 nvidia drivers that might be tweaked in case of VBIOS problens? Downgrading to the PC bios is not really a viable option because using Macs without the option boot selector is rather painful for linux.

howarth.mailing.lists · October 13, 2018, 4:50pm

There is an orthogonal issue apparently with the GTX 680 Mac ROM’ed card versus non-nouveau graphics drivers. I also attempted a manual nvidia driver installation using the recipe described in [url]https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/[/url]. However when I got to the step 2.7 of rebooting after using dracut to rebuild the kernel image without the nouveau driver, I saw the same issue. I assume this means that the vesa drivers currently don’t like the Mac ROMs. Is there any threads here about general linux issues with the Mac ROM’ed card or any place to check whether any additional ROM images beyond the original ones were released by Nvidia?

generix · October 13, 2018, 7:21pm

kernel/driver options to manipulate vbios handling are not available.
Nvidia doesn’t publicly releases vbioses, techpowerup has a collection of user-submitted vbioses
there’s no sense in using the .run installer over the repo drivers except for special use cases
the 340 driver has support for current xservers and (vanilla) kernels, you should use it to evaluate if something changed in comparison to current drivers
I don’t think there are more threads about macpros with nvidia cards running linux here, that’s really a niche, sorry.

howarth.mailing.lists · October 13, 2018, 11:25pm

Okay, finally some success. Starting from a clean install of Fedora 28 I was able to manually install the rpmfusion nvidia 340.107 packaging which on reboot properly used the Nvidia drivers. Installing the rpmfusion repo’s with nvidia excluded allowed all of the Fedora 28 packages to be updated without upgrading nvidia beyond the 340.107 release. Now to see if I can get the ubuntu nvidia-graphics-drivers-340 for cosmic to work with the Mac ROM’ed card.

generix · October 14, 2018, 12:06am

[url]Error | NVIDIA

generix · October 14, 2018, 12:34am

If you don’t rely on cuda, nouveau on Kepler afaik supports manual reclocking and use of any firmware.

howarth.mailing.lists · October 14, 2018, 10:40am

I would like to try to get enough information filed that the linux Nvidia developers could open a sensible bug report about this internally as it does appear to be a regression in the drivers post-340. My assumption is that the addition of the new nvidia_drm kernel module might be related to the regression.

Interestingly, on Ubuntu cosmic with the 340.107 drivers installed, running nvidia-bug-report.sh as root produces a bug report with a nvidia-debugdump report on the normally running drivers. However under fedora 28 with the rpmfusion installation of 396.54, I didn’t get that output. Would it be worthwhile to try to get a nvidia-debugdump report on the malfunctioning drivers?

howarth.mailing.lists · October 14, 2018, 11:58am

Attached is a nvidia-bug-report generated on Fedora 28 with the successfully installed nvidia 340.107 drivers from rpmfusion on the MacPro 3,1 with EVGA GTX 680 using Mac ROM images. I’ll try to build the beta 410 nvidia packages from rawhide rpmfusion under Fedora 28 to see if that helps. One optinn I noticed that was installed in the grub boot option in the working 340.107 installation is ‘video=vesa:off’ which I don’t recall being present for the newer nvidia packages.

howarth.mailing.lists · October 14, 2018, 1:53pm

Starting from a usable rpmfusion nvidia 340.107 installation which used the grub boot options…

nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off

After building and installing the current nvidia 410.57 Fedora rawhide packages on Fedora 28 followed by running ‘sudo akmods --force’ to rebuild the kernel modules, the grub boot options became…

modprobe.blacklist=nouveau nvidia_drm.modeset=1

to which I also appened ‘video=vesa:off’ for good measure. The resulting installation was tested for the combinations of nvidia_drm.modeset=1 with /etc/gdm/custom.conf set to WaylandEnable=true and
nvidia_drm.modeset=0 with /etc/gdm/custom.conf set to WaylandEnable=false.

I have attached two new nvidia-bug-reports

nvidia-bug-report.log-410.57-nvidia-drm.modeset_is_0-gdm_wayland_off.gz
nvidia-bug-report.log-410.57-nvidia-drm.modeset_is_1-gdm_wayland_on.gz

These logs contain the binary debug output. In both cases, the end result was the black screen bug however the vidia_drm.modeset=1 and WaylandEnable=true combination produced a brief white flash immediately before the final black screen.

howarth.mailing.lists · October 15, 2018, 10:37pm

I think I may have found a clue to the problem. Looking at modules.dep, I see…

extra/nvidia/nvidia-drm.ko: extra/nvidia/nvidia-modeset.ko extra/nvidia/nvidia.ko kernel/drivers/gpu/drm/drm_kms_helper.ko.xz kernel/drivers/gpu/drm/drm.ko.xz kernel/drivers/char/ipmi/ipmi_msghandler.ko.xz

under nvidia 410. Using the description of the ipmi modules from [url]Oracle Grid Infrastructure Grid Infrastructure Installation and Upgrade Guide, 12c Release 2 (12.2) for Linux I found only…

$ /sbin/lsmod |grep ipmi
ipmi_devintf 20480 0
ipmi_msghandler 69632 2 ipmi_devintf,nvidia

$ ls -l /dev/ipmi0
ls: cannot access ‘/dev/ipmi0’: No such file or directory

so I looked for the correct number to use for its creation

$ grep ipmi /proc/devices 253 ipmidev
/proc/devices:240 ipmidev

$ sudo mknod /dev/ipmi0 c 240 0x0

$ ls -l /dev/ipmi0
crw-r–r–. 1 root root 240, 0 Oct 15 18:21 /dev/ipmi0

but I still can’t load the other modules such as…

$ sudo /sbin/modprobe ipmi_si
modprobe: ERROR: could not insert ‘ipmi_si’: No such device