NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

marvin.abisrror · August 10, 2019, 10:50pm

Hello everyone, I have been facing issues with my NVIDIA driver and Cuda version. I have tried many things on the internet. However, nothing looks to help.

when I run:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the steps I have done in order to have a global understanding of what I am seeing.

$ sudo systemctl status nvidia-persistenced
[sudo] password for marvin:
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: inactive (dead)

$ sudo systemctl enable nvidia-persistenced
The unit files have no installation config (WantedBy, RequiredBy, Also, Alias
settings in the [Install] section, and DefaultInstance for template units).
This means they are not meant to be enabled using systemctl.
Possible reasons for having this kind of units are:

A unit may be statically enabled by being symlinked from another unit’s
.wants/ or .requires/ directory.
A unit’s purpose may be to act as a helper for some other unit which has
a requirement dependency on it.
A unit may be started when needed via activation (socket, path, timer,
D-Bus, udev, scripted systemctl call, …).
In case of template units, the unit is meant to be enabled with some
instance name specified.

I am not able to use GPU in my computer (Precision 7730):

$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP104GLM [Quadro P5200 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

uname -m && cat /etc/*release
x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION=“Ubuntu 18.04.3 LTS”
NAME=“Ubuntu”
VERSION=“18.04.3 LTS (Bionic Beaver)”
…

$ gcc --version
gcc (Ubuntu 4.8.5-4ubuntu8) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ sudo ./NVIDIA-Linux-x86_64-430.40.run
…
1.- The distribution-provided pre-install script failed! Are you sure you want to continue?
→ Continue installation
2.- Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later.
→ Yes
3.-
Install NVIDIA’s 32-bit compatibility libraries?
ERROR: Failed to run /usr/sbin/dkms build -m nvidia -v 430.40 -k 5.0.0-23-generic:
Kernel preparation unnecessary for this kernel. Skipping…

     Building module:       
     cleaning build area...                                                                                       
     'make' -j12 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.0.0-23-generic IGNORE_CC_MISMATCH='' modules...(bad
     exit status: 2)                                                                                              
     ERROR (dkms apport): binary package for nvidia: 430.40 not found
     Error! Bad return status for module build on kernel: 5.0.0-23-generic (x86_64)
     Consult /var/lib/dkms/nvidia/430.40/build/make.log for more information.

4.- ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
5.- ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

When I try to install sudo sh cuda_10.1.168_418.67_linux.run, I am having the following error:
[INFO]: /NVIDIA-Linux-x86_64-418.67/kernel/nvidia/nv_uvm_interface.o] Error 1

[INFO]: CC [M] /tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel/nvidia/linux_nvswitch.o

[INFO]: cc: error: unrecognized command line option ‘-fstack-protector-strong’

[INFO]: scripts/Makefile.build:284: recipe for target ‘/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel/nvidia/nvlink_linux.o’ failed

[INFO]: make[2]: *** [/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel/nvidia/nvlink_linux.o] Error 1

[INFO]: cc: error: unrecognized command line option ‘-fstack-protector-strong’

[INFO]: scripts/Makefile.build:284: recipe for target ‘/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel/nvidia/linux_nvswitch.o’ failed

[INFO]: make[2]: *** [/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel/nvidia/linux_nvswitch.o] Error 1

[INFO]: make[2]: Target ‘__build’ not remade because of errors.

[INFO]: Makefile:1606: recipe for target ‘module/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel’ failed

[INFO]: make[1]: *** [module/tmp/selfgz30574/NVIDIA-Linux-x86_64-418.67/kernel] Error 2

[INFO]: make[1]: Target ‘modules’ not remade because of errors.

[INFO]: make[1]: Leaving directory ‘/usr/src/linux-headers-5.0.0-23-generic’

[INFO]: Makefile:81: recipe for target ‘modules’ failed

[INFO]: make: *** [modules] Error 2

[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 418.67 failed, quitting

$ ubuntu-drivers list
nvidia-driver-410
nvidia-driver-390
nvidia-driver-415
nvidia-driver-430

Can anyone help me out how to fix it?

Thanks.

generix · August 14, 2019, 9:42am

From the minimal info, I can only say that your system is quite broken.
You have
Ubuntu 18.04 + kernel 5.0 + gcc 4.8
Which is quite a mismatch, how did you get into this? You should rather consider a clean reinstall of Ubuntu and don’t use any .run installer. Use the driver from the Ubuntu repo or the graphics ppa, then download and add the cuda .deb and not install cuda, but run
sudo apt install cuda-toolkit-10-1
to not overwrite the already installed driver.

marvin.abisrror · August 16, 2019, 5:01pm

Hello generix,

I have updated the gcc compiler from 4.8 to 5.5

gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1) 5.5.0 20171010
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

And sent to run:$ sudo sh cuda_10.1.168_418.67_linux.run

= Summary =

Driver: Installed
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Installed in /home/marvin/

Please make sure that

PATH includes /usr/local/cuda-10.1/bin
LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
Logfile is /var/log/cuda-installer.log

And now nvidia-smi works.

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

And now I am able to use GPU and no reinstall needed. However, I remember that I used to work with this GPU but this somehow broke due to the update of the nvidia driver from 418.x to 430.x.
Not sure if this error would happen again if a new update of the software comes up.

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: True

Thank you!