[SOLVED] CUDA 9.0rc and NVIDIA 384.69 but driver version is insufficient for CUDA runtime version

I’m running Ubuntu 16.04.3, with the Nvidia 384.69 drivers installed through Ubuntu’s “Software & Updates” > “Additional Drivers” UI. I also installed bumblebee, primus, mesa and bumblebee-nvidia.

I’ve also set the correct PATH

# NVIDIA
export PATH="$PATH:/usr/local/cuda-9.0/bin"
export PATH="$PATH:/usr/lib/nvidia-384/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib"
export CUDA_HOME=/usr/local/cuda-9.0
export CUDADIR=/usr/local/cuda-9.0
export GLPATH=/usr/lib
$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.69  Wed Aug 16 19:34:54 PDT 2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
dpkg -l | grep nvidia
ii  bumblebee-nvidia                                            3.2.1-10                                     amd64        NVIDIA Optimus support using the proprietary NVIDIA driver
ii  nvidia-384                                                  384.69-0ubuntu0~gpu16.04.1                   amd64        NVIDIA binary driver - version 384.69
ii  nvidia-opencl-icd-384                                       384.69-0ubuntu0~gpu16.04.1                   amd64        NVIDIA OpenCL ICD
rc  nvidia-prime                                                0.8.2                                        amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                             384.69-0ubuntu0~gpu16.04.1                   amd64        Tool for configuring the NVIDIA graphics driver

I installed CUDA 9.0rc through the runfile file method, ignoring the option to install the drivers, which is older (384.59). When I compile the CUDA 9.0 deviceQuery sample, I get this error:

optirun ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

Without optirun,

./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

nvidia-bug-report.log.gz (60.5 KB)

Message removed. Not related.

CUDA 9 (and 8, and 7) require newer drivers than the 342.01 driver you have. That old GPU is no longer supported by recent CUDA versions. The last CUDA version supporting it was CUDA 6.5. This is expected behavior, that CUDA 9 would not work with that notebook.

That is indicated by this error:

cudaErrorInsufficientDriver

and is discussed in many other postings on this forum.

It has nothing to do with, and is not related to the linux issue reported by OP in this thread.

@poppingtonic, referring to the contents of the log file you attached, it seems you have an Acer Predator laptop with GTX 1060m GPU. You should be advised that laptops that originally ship with windows in a optimus configuration can be fairly challenging to set up properly in linux. Having said that, there are a few probably more basic issues in your setup.

I generally don’t recommend that people use an NVIDIA driver from a source other than NVIDIA. The NVIDIA driver can be packaged in a variety of ways, with or without certain modules, and the exclusion of some of these things may make certain features (like CUDA) unusable, or incorrectly configured. Looking at the dmesg log contained in the bug report log, this would seem to be the case to me:

/var/log/dmesg:

journalctl -b -0:
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-opencl-icd-375.list: No such file or directory
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-modprobe.list: No such file or directory
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-375.list: No such file or directory

Sep 07 23:33:57 aleph0 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 07 23:33:57 aleph0 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 07 23:33:57 aleph0 systemd[4855]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:33:57 aleph0 systemd-udevd[4864]: failed to execute ‘/usr/bin/nvidia-smi’ ‘/usr/bin/nvidia-smi’: No such file or directory
Sep 07 23:33:57 aleph0 systemd-udevd[4850]: Process ‘/usr/bin/nvidia-smi’ failed with exit code 2.
Sep 07 23:33:57 aleph0 systemd[4871]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:33:57 aleph0 bumblebeed[1083]: [ 50.171598] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:34:01 aleph0 systemd[4911]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:34:01 aleph0 bumblebeed[1083]: [ 54.301667] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:35:12 aleph0 systemd[5691]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:35:12 aleph0 bumblebeed[1083]: [ 125.013497] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:42:12 aleph0 systemd[6774]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:42:12 aleph0 bumblebeed[6736]: [ 544.841892] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:42:18 aleph0 systemd[6828]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:42:18 aleph0 bumblebeed[6736]: [ 551.142928] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:43:12 aleph0 systemd[7097]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:43:13 aleph0 bumblebeed[6736]: [ 605.570955] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory

In short, I would say your driver install is broken. And this would be also indicated by this basic problem report:

→ CUDA driver version is insufficient for CUDA runtime version

It also appears that you have some components from another driver branch. (375)

If you start over with a clean OS load, and actually install the driver from NVIDIA, I think you may have better luck.

Get it working with 384.59 first. The difference between that and 384.69 is not that important if you want to get CUDA running.

Here’s a new log, after clearing other driver components. I installed the driver using the runfile, then uninstalled it to try using 384.69.

nvidia-bug-report.log.gz (60.6 KB)
dump.zip (46 KB)

Some good news, I got this to work.

python examples/mnist_cnn.py
Using TensorFlow backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-09-11 01:31:39.863286: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU c
omputations.
2017-09-11 01:31:39.863308: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU c
omputations.
2017-09-11 01:31:39.863331: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU comp
utations.
2017-09-11 01:31:39.863335: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU com
putations.
2017-09-11 01:31:39.863365: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU comp
utations.
2017-09-11 01:31:40.193552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA n
ode zero
2017-09-11 01:31:40.194152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1060
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.86GiB
2017-09-11 01:31:40.194166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-11 01:31:40.194188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-09-11 01:31:40.194218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0)
60000/60000 [==============================] - 9s - loss: 0.3327 - acc: 0.8996 - val_loss: 0.0784 - val_acc: 0.9756
Epoch 2/12
60000/60000 [==============================] - 6s - loss: 0.1113 - acc: 0.9669 - val_loss: 0.0558 - val_acc: 0.9817
Epoch 3/12
60000/60000 [==============================] - 6s - loss: 0.0835 - acc: 0.9751 - val_loss: 0.0428 - val_acc: 0.9856
Epoch 4/12
60000/60000 [==============================] - 6s - loss: 0.0697 - acc: 0.9794 - val_loss: 0.0366 - val_acc: 0.9881
Epoch 5/12
60000/60000 [==============================] - 6s - loss: 0.0609 - acc: 0.9818 - val_loss: 0.0352 - val_acc: 0.9885
Epoch 6/12
60000/60000 [==============================] - 6s - loss: 0.0555 - acc: 0.9835 - val_loss: 0.0323 - val_acc: 0.9897
Epoch 7/12
60000/60000 [==============================] - 6s - loss: 0.0501 - acc: 0.9853 - val_loss: 0.0307 - val_acc: 0.9901
Epoch 8/12
60000/60000 [==============================] - 6s - loss: 0.0444 - acc: 0.9863 - val_loss: 0.0278 - val_acc: 0.9907
Epoch 9/12
60000/60000 [==============================] - 6s - loss: 0.0430 - acc: 0.9868 - val_loss: 0.0306 - val_acc: 0.9900
Epoch 10/12
60000/60000 [==============================] - 6s - loss: 0.0403 - acc: 0.9879 - val_loss: 0.0292 - val_acc: 0.9901
Epoch 11/12
60000/60000 [==============================] - 6s - loss: 0.0372 - acc: 0.9889 - val_loss: 0.0302 - val_acc: 0.9903
Epoch 12/12
60000/60000 [==============================] - 6s - loss: 0.0367 - acc: 0.9888 - val_loss: 0.0270 - val_acc: 0.9910

I resolved this by doing the following:

I stopped assuming that optirun would start the card and insert the relevant kernel modules. Here’s what I did, while using bbswitch to manage power for the card.

$ sudo tee /proc/acpi/bbswitch <<< ON
ON
$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
$ sudo modprobe nvidia_384
$ sudo modprobe nvidia_384_uvm
$ lsmod | grep nvidia
nvidia_uvm            684032  0
nvidia              12976128  1 nvidia_uvm

And to turn it off:

$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia
$ lsmod | grep nvidia
$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
 sudo tee /proc/acpi/bbswitch <<< OFF
OFF
$ cat /proc/acpi/bbswitch
0000:01:00.0 OFF

I got a similar issue as the original post.
My solution was simply to run the CUDA samples as root

Thanks @poppingtonic, this workaround gets it going for me. For some reason, mine needs this to turn on, but turns off on its own. In my case a script seems to help:

#!/bin/bash
tee /proc/acpi/bbswitch <<< ON
modprobe nvidia_384
modprobe nvidia_384_uvm
sudo -u <username> optirun $1

with replaced accordingly.

I saved this to a file called “withcuda”, making sure the directory is on the PATH. Then

chmod +xxx withcuda

makes it executable. Now I can just run, e.g.,

sudo withcuda deviceQuery

and it works. So far, anyway.

If you have intel CPU with GPU integrated and need nvidia GPU only for programing and NOT for the display rendering do the following:
uninstall all cuda drivers
install mesa
press ctrl+alt+F1 → login in command shell
type: sudo service lightdm stop
search for cuda driver 384 run file
install cuda driver run file 384 and choose NO when promt for openGL and Xserver
reboot
download cuDNN libs “tar -xzvf cudnn-9.0-linux-x64-v7.tgz”
copy them to cuda-9-0 dir
" cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h
/usr/local/cuda/lib64/libcudnn*
"
make sure you have added “cuda-9-0” paths to system PATH and LD_LIBRARY_PATH
link cuda libs with ldconfig command

hi,
I have a similar problem,
I have 2 GPU configuration Vega64 for displays and 780ti for CUDA.
I managed to install both and check with deviceQuery that CUDA was working properly.
I tried running dQ after restart and got error 35.

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.48  Thu Mar 22 00:42:57 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Any ideas how to deal with this issue?