Unable to detect CUDA-capable device after automatic/forced NVIDIA updated

liory · December 2, 2015, 7:41pm

Hi,

So I had a running system until my NVIDIA got updated 352.39 to 352.63. I tried all major evaluation tools the results are below.

I get the following error on python:
In [1]: from theano.sandbox import cuda as s
In [2]: s.use(‘gpu 0’)

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu 0 is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)
(same on torch and caffe)

If I run the deviceQuery: ./deviceQuery I get:
./deviceQuery
./deviceQuery Starting…Titan X

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
→ no CUDA-capable device is detected
Result = FAIL

If i do: nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Wed Dec 2 11:26:52 2015
Driver Version : 352.63

Attached GPUs : 1
GPU 0000:05:00.0
Product Name : GeForce GTX TITAN X
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0422315005978
GPU UUID : GPU-cfe53b65-5064-4aba-067b-a6dfaafe1fe2
Minor Number : 0
VBIOS Version : 84.00.1F.00.90
MultiGPU Board : No
Board ID : 0x500
Inforom Version
Image Version : G001.0000.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x05
Device : 0x00
Domain : 0x0000
Device Id : 0x17C210DE
Bus Id : 0000:05:00.0
Sub System Id : 0x29923842
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 22 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 12287 MiB
Used : 440 MiB
Free : 11847 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 8 %
Memory : 4 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 34 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 92 C
Power Readings
Power Management : Supported
Power Draw : 22.36 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 275.00 W
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 405 MHz
Applications Clocks
Graphics : 1126 MHz
Memory : 3505 MHz
Default Applications Clocks
Graphics : 1126 MHz
Memory : 3505 MHz
Max Clocks
Graphics : 1518 MHz
SM : 1518 MHz
Memory : 3505 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 1373
Type : G
Name : /usr/bin/X
Used GPU Memory : 285 MiB
Process ID : 2419
Type : G
Name : compiz
Used GPU Memory : 128 MiB

If I do: sudo dmesg | grep NVRM
[ 2523.348540] NVRM: API mismatch: the client has the version 352.39, but
[ 2523.348540] NVRM: this kernel module has the version 352.63. Please
[ 2523.348540] NVRM: make sure that this kernel module and all NVIDIA driver
[ 2523.348540] NVRM: components have the same version.
[ 2523.348551] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22

If I do: nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

I installed everything using the exact documentation on
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-upgrades

My setup:
CUDA-7.5
GeForce GTX TITAN X
Ubuntu 14.04.3 LTS"
Current NVIDIA version is 352.63 the old one was 352.39

Thank you for any help. Many other around me with the same setup have this problem so any help is very much appreciated!

Robert_Crovella · December 2, 2015, 7:53pm

352.39 is the driver version that gets installed via the runfile installer method when you install CUDA 7.5 (using the runfile installer method). If you then attempt to update things using the package manager method, you will break the install.

This is covered in the document you linked:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

If you originally installed CUDA 7.5 using the runfile installer, you must use the runfile installer method to make any updates.

My suggestion would be either to:

Start over with a clean load of the OS, and use one or the other method to install CUDA and drivers.
Attempt to uninstall the components that were installed by the package manager method, and use a runfile install method to make any updates you wish. The drivers can be updated with a runfile installer by downloading a driver installer from nvidia.com.

The uninstall instructions are covered in the document section I linked above.

liory · December 2, 2015, 8:12pm

Thank you for your fast reply!

Sorry, I should have mentioned, I used the debian package from the nvidia page to install CUDA. Not the runfile. I still had the 352.39 version though.

Here what I downloaded:

cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb

and installed with (from the documentation):

$ sudo dpkg -i cuda-repo-.deb

Update the Apt repository cache

$ sudo apt-get updateinstall

Install CUDA

$ sudo apt-get install cuda

Robert_Crovella · December 2, 2015, 8:24pm

Then the 352.63 package that was used to update your machine did not contain the necessary components. It’s not clear what transpired, exactly. It’s not clear whether this driver update was one that you performed manually or occurred through some other process. If you performed it manually, you haven’t indicated anywhere that I can see what commands you used.

Perhaps you should try:

sudo apt-get install cuda-drivers

liory · December 2, 2015, 11:38pm

Problem solved! It was caused by a update of the nvida driver. Anyhow thank you for your help!

The problem was that there were two debian packages with the same name and different version numbers. 352.63 and 352.39. The .39 (the correct version) was from the local .deb file downloaded from the cuda downloads and the .63 (wrong one) was from the standard ubuntu sources.

This can be solved by the problem with
sudo apt-get remove --purge nvidia*
sudo apt-get install nvidia-352=352.39-0ubuntu1 nvidia-352-dev=352.39-0ubuntu1

Followed by
sudo apt-get install cuda