No cuda device found

ashlaban · June 30, 2017, 1:52pm

Hi,
I’m having an issue with cuda 7.5 on ubuntu 15.10 after I did an apt-get upgrade. I have subsequently uninstalled and reinstalled the complete cuda packaged (including all dependencies) using apt-get remove and apt-get autoremove.

After reinstallation the output of the deviceQuery example is “cudaGetDeviceCount returned 38 → no CUDA-capable device is detected”. This despite the fact that lspci detects an nvidia gpu and the driver reports itself as being installed. The one odd detail that I have found so far is the output of cat /proc/driver/nvidia/gpus/0000:01:00.0/information which contains a lot of question marks. I would suspect this means the driver was not installed properly after all.

My question is, how to continue troubleshooting in this circumstance?

lspci output:
$ lspci | grep NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)

device files:
$ ls /dev/nvidia* -l
crw-rw-rw- 1 root root 195, 0 Jun 30 15:39 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 30 15:38 /dev/nvidiactl

driver version
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.99 Mon Jul 4 23:52:14 PDT 2016
GCC version: gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2)

driver gpu info
$ cat /proc/driver/nvidia/gpus/0000:01:00.0/information
Model: GeForce GTX 980 Ti
IRQ: 131
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:01:00.0

Robert_Crovella · June 30, 2017, 2:37pm

perhaps you just need a reboot

if after reboot, the situation is not restored, try running deviceQuery with root privilege

if that doesn’t work, take a look at the output of:

dmesg |grep NVRM

ashlaban · June 30, 2017, 2:56pm

Thanks for the quick reply!

The situation was not resolved after reboot :( No change when running deviceQuery with root privileges either. Output of dmesg after reboot and one invokation of deviceQuery is as follows:

$ dmesg | egrep ‘nvidia|NVRM’
[ 2.744582] nvidia: module license ‘NVIDIA’ taints kernel.
[ 2.746294] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.754294] [drm] Initialized nvidia-drm 0.0.0 20150116 for 0000:01:00.0 on minor 0
[ 2.754297] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 352.99 Mon Jul 4 23:52:14 PDT 2016
[ 21.149649] NVRM: RmInitAdapter failed! (0x2d:0x63:1406)
[ 21.149660] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 21.149676] NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5
… repeated some 5 times …
[ 42.315600] NVRM: RmInitAdapter failed! (0x2d:0x63:1406)
[ 42.315612] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 42.315629] NVRM: nvidia_frontend_open: minor 0, module->open() failed, error -5

Robert_Crovella · June 30, 2017, 3:08pm

Clearly the driver thinks something is wrong, but it’s not clear what. You might want to see if a power-cycle fixes it (doubtful). Apart from that, my best guess is a corrupted driver install.

You don’t indicate the exact clean up process, nor the exact method by which you did the reinstall. Depending on which archive you got the 352.99 driver from, it’s possible that there are missing or mismatched pieces.

Package manager installs are great when they work. When they don’t, it can often be difficult to scour the system for the mismatched driver file. You could consider using a method like this:

sudo apt list --installed |grep nv

and parse through the output looking for driver packages that don’t match your 352.99

At this point I am not an expert and would usually give up and try a clean/reinstall again, carefully following the instructions in the linux install guide, while being sure to use only .deb archives or .run files from an NVIDIA source (i.e. not ppa or anything like that) and hope for a better result.

Barring that, a clean OS load and reinstall. I’m not suggesting this is a trivial matter, just that it usually works for me.

You’ve now learned that apt-get upgrade can be a recipe for trouble.

When I am having extreme difficulty, I will usually try to install the driver by itself. In that case, I can inspect the driver installer log for clues about the actual problem, and usually with some google search I can understand how to work around it. Such an example is here:

[url]https://devtalk.nvidia.com/default/topic/978103/cuda-support-for-legacy-gpus/[/url]

It’s not relevant to your case, just pointing out that some driver install scenarios require a high level of effort to figure out. For example, if your apt-get upgrade pulled in a Linux kernel version for which the driver install process could not properly compile against the kernel headers, this would show up in the installer log. It may also be evident during the package install process, but you have to look at the package install output pretty closely to see it.

ashlaban · June 30, 2017, 3:15pm

This is what I was fearing. Thanks for the help.

Scanning through the list I see only packages with the correct version number. So reinstalling it is. Unfortunately it seems that cuda 7.5 is no longer provided from the official site. I’ll try installing cuda 8 (which happens to be available only for 16.04 and 14.04 not 15.10). We’ll see what happens.

Yeah, it really can. But how then are we supposed to keep our system up to date and secure?

Again many thanks!

Robert_Crovella · June 30, 2017, 3:24pm

CUDA 7.5 is available on the legacy CUDA toolkits page (google search on that)

The methodology to allow system-wide upgrades (which may upgrade the kernel) currently involves DKMS (google that) AFAIK.

DKMS is designed to allow the driver to get recompiled/rebuilt “automatically” either at each reboot or when the system determines it is needed. I’m not an expert on DKMS. Furthermore:

DKMS often has to be manually installed and configured, the methodology may vary by OS
I’m not sure it works in 100% of the cases (for example, I don’t think DKMS can/would fix the previous example I linked, where the kernel upgrade introduced an incompatibility in the driver interface compile process).

The NVIDIA graphics driver on Linux has considerable complexity, and not every scenario is covered automatically.

ashlaban · June 30, 2017, 4:03pm

I reinstalled using the cuda 8 suite for ubuntu 16.04 and now the card can be found (the samples also compile and run correctly!).

In the end it turned out that I didn’t need the exact version 7.5 so this issue can be considered resolved.

Thanks for the help and insights txbob :)