Which step leads me fail to find my NVIDIA card during installing CUDA and cuDNN?

Yifan95 · April 28, 2017, 12:31am

I tried to install CUDA and cnDNN on Ubuntu Server 16.10 (I did not find itis not 16.04 first but I don’t think it is the reason because I installed successfully in the first time).

- Find a CUDA capable card:

$ lspci | grep -i NVIDIA
90ff:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

- Install CUDA:

$ CUDA_REPO_PKG=cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
$ wget -O /tmp/${CUDA_REPO_PKG} http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} 
$ sudo dpkg -i /tmp/${CUDA_REPO_PKG}
$ rm -f /tmp/${CUDA_REPO_PKG}
$ sudo apt-get update
$ sudo apt-get install cuda-drivers
$ export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64\
                 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
$ reboot

- So far everything seems fine

$ nvidia-smi

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

- Then I update packages:

$ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo apt-get dist-upgrade -y
$ sudo apt-get install cuda-drivers
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.51  Wed Mar 22 10:26:12 PDT 2017
GCC version:  gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) 
$ nvcc -V
The program 'nvcc' is currently not installed. You can install it by typing: sudo apt install nvidia-cuda-toolkit

- Install cuDNN:

$ wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v6/prod/8.0_20170307/cudnn-8.0-linux-x64-v6.0-tgz

Now I want to check GPU Information

$ lspci | grep -i nvidia
lspci: Cannot open /sys/bus/pci/devices/a28a:00:00.0/resource: No such file or directory

It said there is no GPU detected, which is impossible.

Which step leads me fail to find my NVIDIA card during installing CUDA and cuDNN?

Robert_Crovella · April 28, 2017, 4:36am

maybe your K80 GPU is overheating.

You’re missing some steps when you install CUDA. I suggest reading the linux install guide. You are supposed to set some env. variables, then you will find nvcc

Once you get the the lspci error, try running nvidia-smi again. If it cannot connect, then your K80 GPU may be overheating.

It’s also a little bit wierd that only 1 K80 device is showing up in nvidia-smi. Normally there are two, unless you are using a cloud instance or VM instance.

Also, gcc 6.2 is not compatible with CUDA. If you use ubuntu 16.04, you will gcc 5.4 which will work with the latest 8.0.61 CUDA release.

pengosunvidia · April 28, 2017, 5:05am

I am suffering the same problem.
lspci: Cannot open /sys/bus/pci/devices/38b3:00:00.0/resource: No such file or directory
The command nvidia-smi does not work either. Can you try to type nvidia-smi?

Yesterday I didn’t have this problem. At that time, I was able to detect the device and run tensorflow.

Robert_Crovella · April 28, 2017, 5:16am

Again, my guess would be an overheating K80. If your K80 is not installed in an OEM server that is certified and properly equipped for the K80, you are asking for trouble in my opinion.

It’s not possible to debug a problem with a report like “I am suffering the same problem.”

A lot more details than that are needed. And I’m not suggesting I would try to debug problems associated with running a K80 in a system that was not designed for it.

pengosunvidia · April 28, 2017, 5:26am

Thanks for replying.

I am using ubuntu 16.04 server with one k80 card. Yesterday I was able to detect the gpu device and run tensorflow. Also nvidia-smi works well. But now the gpu device cannot be detected.

The errors are shown as follows:

lspci
lspci: Cannot open /sys/bus/pci/devices/38b3:00:00.0/resource: No such file or directory

nvidia-smi
the command window freezes (i.e., does not return anything)

deviceQuery

the command window freezes (i.e., does not return anything)

Can you help me working on this?

Thank you.

Yifan95 · April 28, 2017, 6:45am

I meet the same problem.

Yifan95 · April 28, 2017, 6:50am

I am using Azure VM. Does VM still have overheating issue?

Robert_Crovella · April 28, 2017, 10:51am

No, Azure VM should not have overheating issue.

pengosunvidia · April 28, 2017, 3:52pm

I am using Azure VM, too. Could it be the problem of pci?

https://forums.gentoo.org/viewtopic-t-1010510-start-0.html

pengosunvidia · April 28, 2017, 3:58pm

/sys/bus/pci/devices$ ls

0000:00:00.0 0000:00:07.1 0000:00:08.0
0000:00:07.0 0000:00:07.3 237738b3:00:00.0

Yifan95 · April 29, 2017, 12:37am

No idea…

Yifan95 · April 29, 2017, 7:28am

Have you solved that?

Yifan95 · April 30, 2017, 12:25am

Maybe this is owing to I have stopped(deallocated) the Azure VM, and then started VM again. According to [1], the hardware IP(like gpu,cpu) has changed when you stop(deallocated) and then start VM again. But the Ubuntu system hasn’t been updated for new hardware(like gpu, cpu) IP address. Hence, lspci will tell you cannot open some hardware ip address related folder.

[1]Difference Between the States of Azure Virtual Machines: Stopped and Stopped (Deallocated) | Microsoft Docs

Yifan95 · April 30, 2017, 12:26am

Maybe this is owing to you have stopped(deallocated) the Azure VM, and then started VM again. According to [1], the hardware IP(like gpu,cpu) has changed when you stop(deallocated) and then start VM again. But the Ubuntu system hasn’t been updated for new hardware(like gpu, cpu) IP address. Hence, lspci will tell you cannot open some hardware ip address related folder.

[1]Difference Between the States of Azure Virtual Machines: Stopped and Stopped (Deallocated) | Microsoft Docs

pengosunvidia · April 30, 2017, 3:14am

It is possible. For me, the problem comes after adding a new gpu card.

Robert_Crovella · April 30, 2017, 1:53pm

If you stop a VM, add hardware, and then start it, I would expect things to get messed up.

At a minimum, you should reboot the VM after adding hardware, and even that may be problematic.

pengosunvidia · April 30, 2017, 4:30pm

Reboot does not help.

Robert_Crovella · April 30, 2017, 6:47pm

Then you may need to reload the OS after making hardware changes in the VM.

pengosunvidia · May 1, 2017, 11:13pm

Have you solved the problem?

pengosunvidia · May 2, 2017, 6:03pm

Instructions from Microsoft support solved the problem.

Here’s what they told me:

Canonical appears to have recently released kernel 4.4.0-75 for Ubuntu 16.04 and this is having an adverse effect on Tesla GPUs on NC-series VMs. Installation of the 4.4.0-75 breaks the 8.0.61-1 version of the NVIDIA CUDA driver that’s currently recommended for use on these systems, resulting in nvidia-smi not showing the adapters and lspci returning an error similar to the following:

~# lspci
lspci: Cannot open /sys/bus/pci/devices/XXXXXXXXX/resource: No such file or directory
They suggest backing up the OS drive, running

apt-get remove linux-image-4.4.0-75-generic

and then

update-grub

Reboot and it should work!