I tried to install CUDA and cnDNN on Ubuntu Server 16.10 (I did not find itis not 16.04 first but I don’t think it is the reason because I installed successfully in the first time).
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 90FF:00:00.0 Off | 0 |
| N/A 43C P0 62W / 149W | 0MiB / 11439MiB | 74% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
- Then I update packages:
$ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo apt-get dist-upgrade -y
$ sudo apt-get install cuda-drivers
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.51 Wed Mar 22 10:26:12 PDT 2017
GCC version: gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)
$ nvcc -V
The program 'nvcc' is currently not installed. You can install it by typing: sudo apt install nvidia-cuda-toolkit
You’re missing some steps when you install CUDA. I suggest reading the linux install guide. You are supposed to set some env. variables, then you will find nvcc
Once you get the the lspci error, try running nvidia-smi again. If it cannot connect, then your K80 GPU may be overheating.
It’s also a little bit wierd that only 1 K80 device is showing up in nvidia-smi. Normally there are two, unless you are using a cloud instance or VM instance.
Also, gcc 6.2 is not compatible with CUDA. If you use ubuntu 16.04, you will gcc 5.4 which will work with the latest 8.0.61 CUDA release.
I am suffering the same problem.
lspci: Cannot open /sys/bus/pci/devices/38b3:00:00.0/resource: No such file or directory
The command nvidia-smi does not work either. Can you try to type nvidia-smi?
Yesterday I didn’t have this problem. At that time, I was able to detect the device and run tensorflow.
Again, my guess would be an overheating K80. If your K80 is not installed in an OEM server that is certified and properly equipped for the K80, you are asking for trouble in my opinion.
It’s not possible to debug a problem with a report like “I am suffering the same problem.”
A lot more details than that are needed. And I’m not suggesting I would try to debug problems associated with running a K80 in a system that was not designed for it.
I am using ubuntu 16.04 server with one k80 card. Yesterday I was able to detect the gpu device and run tensorflow. Also nvidia-smi works well. But now the gpu device cannot be detected.
The errors are shown as follows:
lspci
lspci: Cannot open /sys/bus/pci/devices/38b3:00:00.0/resource: No such file or directory
nvidia-smi
the command window freezes (i.e., does not return anything)
deviceQuery
the command window freezes (i.e., does not return anything)
Maybe this is owing to I have stopped(deallocated) the Azure VM, and then started VM again. According to [1], the hardware IP(like gpu,cpu) has changed when you stop(deallocated) and then start VM again. But the Ubuntu system hasn’t been updated for new hardware(like gpu, cpu) IP address. Hence, lspci will tell you cannot open some hardware ip address related folder.
Maybe this is owing to you have stopped(deallocated) the Azure VM, and then started VM again. According to [1], the hardware IP(like gpu,cpu) has changed when you stop(deallocated) and then start VM again. But the Ubuntu system hasn’t been updated for new hardware(like gpu, cpu) IP address. Hence, lspci will tell you cannot open some hardware ip address related folder.
Instructions from Microsoft support solved the problem.
Here’s what they told me:
Canonical appears to have recently released kernel 4.4.0-75 for Ubuntu 16.04 and this is having an adverse effect on Tesla GPUs on NC-series VMs. Installation of the 4.4.0-75 breaks the 8.0.61-1 version of the NVIDIA CUDA driver that’s currently recommended for use on these systems, resulting in nvidia-smi not showing the adapters and lspci returning an error similar to the following:
~# lspci
lspci: Cannot open /sys/bus/pci/devices/XXXXXXXXX/resource: No such file or directory
They suggest backing up the OS drive, running