Some funkiness with my first try at running a linux nvidia driver.

Recently converted a Windows 8.1 laptop to a dual-boot with Ubuntu 14.04 in order to work on some linux open source neural net software.

Need CUDA and thus an nvidia driver. Have it working except for two inconveniences.

Screen occasionally freezes. I’ve found that I can get the screen working again by 1) ctrl-alt-f1 to drop into a tty then 2) ctrl-alt-f7 to switch back to X - at which point the driver is responsive again.

The other problem is when I need to reboot. The nvidia driver built a new version of the kernel: 3.16.0-45-generic where the old one is 30. If I try to boot directly into 45, screen freezes before the logon screen. I boot first into 30 then restart into 45 at which point the logon screen comes up and I can login.

It’s nice to have found the workarounds and there’s not a need to use them frequently however it would be better to understand what’s going on and fix it.

“The nvidia driver built a new version of the kernel” is a little surprising. I’m not sure how this will be interpreted on this forum, but my best advice is to not use the .run file driver installer if at all possible. This is something that tends to trip up people who are coming to Linux from Windows, who are used to having separate installers for individuals drivers and programs, and it’s really not a sensible way of doing things when you a Linux package manager you can use instead. The only companies that even package Linux software with executable .run installers are generally those that maintain Windows code and are just used to doing it that way, and it’s almost always going to introduce more complications than necessary (like your new kernel).

Generally, Nvidia drivers, being proprietary, will not in the default set of sources configured by a given distro’s package manager, so you’ll need to add some additional repos. In Ubuntu, these come in the form of additional PPAs, which you can add via apt-get at the command line, or in the synaptic package manager. For Nvidia drivers in Ubuntu, there’s this new semi-official PPA, which was created specifically in response to too much confusion as to how they’re supposed to work:

As well as this other one I’ve been using for a while, that looks like it’s just been deprecated in favour of the new one:

https://launchpad.net/~mamarley/+archive/ubuntu/nvidia

I can see the argument for “I’d rather get something directly from the vendor rather than through some guy who happens to maintain a build pipeline for Ubuntu,” but trust me, Linux is all about the package manager. For example, I have no idea how to advise you on removing the driver version you installed via the .run file; if it were installed via the package manager, it’d be a simple apt-get remove xyz.

Hope that helps!

Thanks - that is useful.

No I didn’t install from the .run package. Or rather I did and suffered the apparently predictable consequence and, shortly thereafter, re-installed Ubuntu. Live and learn.

I attempted many things before I got what’s working now. What finally did the trick was:

sudo apt-get install cuda-6.5

I tried cuda 7.0 and it didn’t work either (frozen/hung machine). So I thought I’d back up a version and 6.5 did work. More or less works with the inconveniences that I mentioned.

I’m currently running 346.82. I see that at the url you’ve provided that there are driver versions up to 355 and at some point, I’ll try out more recent versions.

rebuilt the kernel.

to be more exact, apt-get install fired builds created new initrd* and vmlinuz* files in /boot. I have the original version 30 group and, as I was mucking around trying to get cuda installed, a 45 group got built and rebuilt several times. Very long ago (decades) I had to rebuild the kernel to install drivers and I know that drivers, for the most part, now dynamically load and given what’s become the very approximate state of my linux systems knowledge I called new files in /boot rebuilding the kernel, which is possibly incorrect.

Yup, it’s just building modules against your current installed kernel(s) – which is something that any functioning install-from-package-manager script would do too. Even that’s not so typical these days given that the kernel itself contains pretty much all the GPL drivers you’d ever need; the only frequent exceptions I can think of are Virtualbox (which must be non-GPL or something?) and nvidia binary.

Next iteration.

I made considerable progress by upgrading everything, in particular, reinstalling the OS with Ubuntu 15.04 instead of 14.04. I had previously to install cuda 6.5 I can now install cuda 7.0. Which results in nvidia driver 346.59.

All previous driver-related problems disappeared with the exception of one. I have two kernel versions (or what I maybe mistakenly identify as kernel versions) in /boot: 3.19.0.15 and 3.19.0.26. It’s 26 that has cuda and the nvidia driver installed. The problem is I can’t boot directly into 26. If I try to boot directly into 26, it hangs at the screen that shows just “Ubuntu” with 5 white/red dots below it. The dots are a progress bar of course.

What I have to do is use the advanced boot menu to boot into 15, once the login screen arrives I do a quick restart into 26 and it will then successfully proceed to the login screen.

So there’s a video driver hang prior to getting to the login screen. It seems to me that what I’m doing is resetting state somewhere when I temporarily backup to 15 which then allows a boot into 26. But state just where?

I’d like to capture the state from the first frozen boot into 26 and found a file /var/log/boot.log however that seems to be only the current boot.

Any other pointers to tools that would give me low-level info about a previous boot would be appreciated.

menomnon, can you press the escape key at the Ubuntu screen with the dots and if so, do you see any messages that indicate what’s going wrong? If you can get a login prompt that way (you may also need to press Ctrl-Alt-F1), please run “sudo nvidia-bug-report.sh” and attach the resulting nvidia-bug-report.log.gz file here.

Running “sudo nvidia-uninstall” should take care of that.

It hangs at the Ubuntu screen just after all 5 dots have gone red and, at which point, it responds to neither ESC nor ctrl-alt-F1.

Hmm.

Can you get into the GRUB boot menu? I know Ubuntu makes that annoyingly difficult but I’m sure there’s a way. If you can interrupt the boot process there and edit the boot entry, you can try removing the “quiet” and “splash” options before booting the kernel. That might provide more information about what’s going wrong.

Yes I can get into grub (prior to booting) however what I think you just said is that I should modify some grub parameters prior to a restart?

If you give me instructions, I’d be glad to carry them out since I’m sure I’ll learn something in the process.

I succeeded in turning off the splash screen with the instructions here:

http://askubuntu.com/questions/33416/how-do-i-disable-the-boot-splash-screen-and-only-show-kernel-and-boot-text-inst

except the messages go by really fast and when the screen hangs no messages are displayed.

So I need to figure out the next step.

Successive reboots seem to overwrite /var/log/boot.log so I realized I could boot into the hang boot then reboot into Windows (this is a dualboot system) and since I have an extfs driver, I’d be able to save off bad files.

As it happens, /var/log/boot.log showed no differences between bad and good boots. So I thought: hmm, downstream of that? Which would probably mean X.

Went through various X files to see what I might find. Xorg.0.log may pinpoint the problem. In particular in the bad case you don’t get to the second of these two lines:

[     3.595] (II) Module "ramdac" already built-in
[     3.596] (II) intel(G0): Using Kernel Mode Setting driver: i915, version 1.6.0 20141121

It would appear that in the bad case things hang just after ramdac.

I’ll attach a zip file with two Xorg.0.log: one good, one bad.

XorgLog.zip (8.06 KB)

Sounds like you have onboard Intel graphics that your machine might be trying to use instead or on top of the nVidia graphics? Might want to try disabling those in the BIOS/UEFI

Yes, it’s an Asus N550JV-DB71 and has 2 graphical systems: Intel and Nvidia. I’ll take a look to see if I can somehow turn off the Intel onboard graphics.

As best I can tell, both researching it on the web as well as going through the BIOS closely, I can’t simply turn off the Intel graphics. It was a nice idea. I’ll guess that it may have made the laptop more expensive had the two graphical systems been fully switchable in this sense.

So where does that leave me? I can continue to shall we say double-boot, that is, boot first into 3.19.0.15 and only then into 26. Tedious but I’ve pushed on it quite a bit and it seems reliable. But again, it’s telling me something about state change and if I can just run that down, I’ll be closer to a solution. The code for these drivers is open-sourced (I believe) and so conceivably I could build the driver for myself and figure out how to debug it.

Oh, it’s a notebook, eh? Right, I doubt you’d be able to outright disable one of the GPUs in that case. There’s a name for switchable Intel/nVidia setups – Bumblebee or Optimus or something – that you can read up on, and I’m sure someone maintains a Ubuntu PPA that will set everything up for you automatically. Good luck with that; it seems more complicated than necessary to me (my past several notebooks have all been exclusively Intel graphics because I don’t mind having also having a desktop to put a discrete GPU into), but it’s not that uncommon.

The code is not open sourced and it was probably naive to think it is. I believe this is one of the more important locations for nvidia linux drivers?

And (duh) the page is marked as proprietary drivers.

That said, there is driver source code on the machine under /usr/src/nvidia-346-346.59 and it looks like that at the most recent install this code may have built and been deployed from. I doubt though that this is the full driver and the more proprietary bits would be locked in libraries.

The driver remains an interesting research project but I think the things I should really be concerned with lie elsewhere. So for the time being not only is the machine a dualboot but the linux part of it is a double-boot. I may or may not have a revelation at some point as to what the state change is that occurs between 15 and 26.

Even if it isn’t open sourced I’d expect there to be a PPA that provides binary packages and/or automatic configuration scripts. Google around for Bumblebee PPA or something of the like.

Bumblebee is at the same launchpad location

I have the ppa config’ed on my system and so I can see it in apt-cache search.

I wondered: what is Bumblebee? Nvidia Optimus it seems and in particular Optimus is for laptops that have dual graphics systems - such as mine. Ah, that’s interesting - it’s precisely my dual graphics systems that the driver is stumbling upon currently.

But is the Bumblebee driver compatible with cuda 7.0? I’m googling now to that effect but this question is less obvious.

I installed the latest bumblebee with sudo apt-get install bumblebee-nvidia. Rebooted. Had to boot twice - first 15 then 26 - so no improvement there.

I’m still unsure what bumblebee is. Dig around a bit. Ah it’s a service:

MyId@MyId-N550JV:/usr/local/cuda-7.0/samples/1_Utilities/bandwidthTest$ ps aux | grep -i bumblebee
root       735  0.0  0.0  36492  3324 ?        Ss   11:49   0:00 /usr/sbin/bumblebeed
patfla    4110  0.0  0.0  13696  2244 pts/1    S+   12:33   0:00 grep --color=auto -i bumblebee
MyId@MyId-N550JV:/usr/local/cuda-7.0/samples/1_Utilities/bandwidthTest$

So it’s not a driver (probably) and I shouldn’t expect it to solve my driver problem.

It seems you can test it with the following command - (except it fails in the way shown)

MyId@MyId-N550JV:/usr/local/cuda-7.0/samples/1_Utilities/bandwidthTest$ optirun glxgears -info
[ 1825.473926] [ERROR]Cannot access secondary GPU, secondary X is not active.

[ 1825.473974] [ERROR]Aborting because fallback start is disabled.

So it is there (daemon) but it doesn’t work … but I’m not sure it’s what I need anyway.

I’ve continued to investigate in the background.

When it hangs upon boot, it is crashing in the kernel. I find the following in /var/log/kern.log

Sep  9 14:37:58 MyId-N550JV kernel: [    2.173418] cfg80211:   (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A, 2000 mBm), (N/A)
Sep  9 14:37:58 MyId-N550JV kernel: [    2.173420] cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 0 mBm), (N/A)
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179717] BUG: unable to handle kernel NULL pointer dereference at           (null)
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179725] IP: [<ffffffff817c9248>] __down+0x48/0xe0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179731] PGD 224294067 PUD 222649067 PMD 0 
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179735] Oops: 0002 [#1] SMP 
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179737] Modules linked in: intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp arc4 kvm_intel asus_nb_wmi asus_wmi kvm sparse_keymap mxm_wmi ath9k snd_hda_intel(+) ath9k_common crct10dif_pclmul snd_hda_controller crc32_pclmul nvidia(POE+) snd_hda_codec ath9k_hw ghash_clmulni_intel snd_hwdep ath snd_pcm aesni_intel mac80211 uvcvideo i915(+) aes_x86_64 videobuf2_vmalloc snd_seq_midi lrw videobuf2_memops snd_seq_midi_event gf128mul videobuf2_core snd_rawmidi glue_helper ath3k ablk_helper v4l2_common btusb cryptd videodev cfg80211 snd_seq media bluetooth snd_seq_device snd_timer drm_kms_helper drm joydev snd ie31200_edac mei_me i2c_algo_bit edac_core soundcore mei lpc_ich serio_raw shpchp wmi mac_hid video parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq psmouse r8169 ahci libahci mii
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179792] CPU: 1 PID: 474 Comm: nvidia-persiste Tainted: P           OE  3.19.0-28-generic #30-Ubuntu
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179795] Hardware name: ASUSTeK COMPUTER INC. N550JV/N550JV, BIOS N550JV.208 11/19/2013
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179797] task: ffff880222b45850 ti: ffff8802233fc000 task.ti: ffff8802233fc000
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179800] RIP: 0010:[<ffffffff817c9248>]  [<ffffffff817c9248>] __down+0x48/0xe0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179804] RSP: 0018:ffff8802233ffb18  EFLAGS: 00010086
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179805] RAX: 0000000000000000 RBX: ffffffffc1349570 RCX: ffffffffc1349578
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179807] RDX: ffff8802233ffb18 RSI: ffffffffc119dba3 RDI: ffffffffc1349570
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179809] RBP: ffff8802233ffb58 R08: 000000000001df10 R09: ffff8800c82fb000
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179811] R10: 0000000000000292 R11: 0000000000017c30 R12: 7fffffffffffffff
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179813] R13: ffff880222b45850 R14: 0000000000000002 R15: 00000000000000ff
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179816] FS:  00007fe95debe700(0000) GS:ffff88022ee40000(0000) knlGS:0000000000000000
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179820] CR2: 0000000000000000 CR3: 0000000222d90000 CR4: 00000000001407e0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179822] Stack:
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179824]  ffffffffc1349578 0000000000000000 ffffffffc10aed0b ffff88022436a000
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179827]  ffffffffc1349570 ffff8800c8938000 ffff8800c8ed7b48 0000000000000003
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179831]  ffff8802233ffb88 ffffffff810be014 000000000001df10 0000000000000292
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179834] Call Trace:
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179910]  [<ffffffffc10aed0b>] ? nvidia_open+0x8b/0x850 [nvidia]
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179915]  [<ffffffff810be014>] down+0x44/0x50
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179954]  [<ffffffffc10aefef>] nvidia_open+0x36f/0x850 [nvidia]
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179995]  [<ffffffffc10bc61d>] nvidia_frontend_open+0x4d/0xa0 [nvidia]
Sep  9 14:37:58 MyId-N550JV kernel: [    2.179999]  [<ffffffff811f9a0f>] chrdev_open+0x9f/0x1d0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180002]  [<ffffffff811f9970>] ? cdev_put+0x30/0x30
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180005]  [<ffffffff811f2432>] do_dentry_open+0x1d2/0x330
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180008]  [<ffffffff811f3cd8>] vfs_open+0x58/0x60
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180011]  [<ffffffff81202877>] do_last+0x247/0x12c0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180015]  [<ffffffff812058e0>] path_openat+0x80/0x5f0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180018]  [<ffffffff81077dea>] ? release_task+0x38a/0x480
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180021]  [<ffffffff8120706a>] do_filp_open+0x3a/0xb0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180024]  [<ffffffff81213e47>] ? __alloc_fd+0xa7/0x130
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180027]  [<ffffffff811f405a>] do_sys_open+0x12a/0x280
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180031]  [<ffffffff81077870>] ? task_stopped_code+0x60/0x60
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180038]  [<ffffffff811f41ce>] SyS_open+0x1e/0x20
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180045]  [<ffffffff817cb6cd>] system_call_fastpath+0x16/0x1b
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180046] Code: 80 b9 00 00 48 83 ec 20 48 8b 47 10 48 89 4d c0 48 89 57 10 48 89 fb 49 bc ff ff ff ff ff ff ff 7f 41 be 02 00 00 00 48 89 45 c8 <48> 89 10 4c 89 6d d0 c6 45 d8 00 0f 1f 44 00 00 49 c7 45 00 02 
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180070] RIP  [<ffffffff817c9248>] __down+0x48/0xe0
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180074]  RSP <ffff8802233ffb18>
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180076] CR2: 0000000000000000
Sep  9 14:37:58 MyId-N550JV kernel: [    2.180079] ---[ end trace 8cde2489b6728ac0 ]---
Sep  9 14:37:58 MyId-N550JV kernel: [    2.191287] Adding 8268796k swap on /dev/sda8.  Priority:-1 extents:1 across:8268796k SSFS

and in particular, it’s crashing in routine or module nvidia_open.