cuda install fail - ubuntu 14.04

i am having some trouble installing cuda support on an ibm softlayer machine with nvidia K80.
since we’d like to get a series of these up we need to either fix this , get a different gpu, or move from softlayer to another host.

install steps i took:

  1. preinstall -
lspci|grep -i nvidia
83:00.0 3D controller: NVIDIA Corporation Device 102d (rev a1)
84:00.0 3D controller: NVIDIA Corporation Device 102d (rev a1)

 uname -m && cat /etc/*release
root@brain2:~#  uname -m && cat /etc/*release
x86_64
...
DISTRIB_DESCRIPTION="Ubuntu 14.04.3 LTS"

root@brain2:~# gcc --version
gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4

didn’t check checksum since no checksum for cuda_7.5.18_linux.run is listed at https://developer.nvidia.com/cuda-downloads/checksums
(and no filesizes listed for what is there, btw)

  1. downloaded runfile
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run

disabled nouveau drivers

root@brain2:~# more /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

didn’t reboot since i only have cmdline access to machine anyway

  1. ran runfile
chmod +x cuda_7.5.18_linux.run 
 sudo sh cuda_7.5.18_linux.run
  1. notice missing libs for examples, try to install, give up:
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

 freeglut3-dev : Depends: libgl1-mesa-dev but it is not going to be installed or
                          libgl-dev
 libglu1-mesa-dev : Depends: libglu1-mesa (= 8.0.2-0ubuntu3) but 9.0.0-2 is to be installed
                    Depends: libgl1-mesa-dev but it is not going to be installed or
                             libgl-dev
 libxmu-dev : Depends: libxmu6 (= 2:1.1.0-3) but 2:1.1.1-1 is to be installed

try adding sources

echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list
more /etc/apt/sources.list
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

no dice, give up on examples

  1. reboot

  2. verify device nodes - FAIL ! no /dev/nvidia* exists. tried nvidia-smi , that command doesnt succeed:

root@brain2:~# nvidia-smi         
modprobe: FATAL: Module nvidia not found.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

fwiw, nvidia-fb does work:

root@brain2:~# modprobe nvidiafb -vvvv
modprobe: INFO: ../libkmod/libkmod.c:354 kmod_set_log_fn() custom logging function 0x7fcd9d64b090 registered
modprobe: DEBUG: ../libkmod/libkmod-index.c:790 index_mm_open() file=/lib/modules/3.13.0-74-generic/modules.dep.bin
modprobe: DEBUG: ../libkmod/libkmod-index.c:790 index_mm_open() file=/lib/modules/3.13.0-74-generic/modules.alias.bin
modprobe: DEBUG: ../libkmod/libkmod-index.c:790 index_mm_open() file=/lib/modules/3.13.0-74-generic/modules.symbols.bin
modprobe: DEBUG: ../libkmod/libkmod-index.c:790 index_mm_open() file=/lib/modules/3.13.0-74-generic/modules.builtin.bin
modprobe: DEBUG: ../libkmod/libkmod-module.c:529 kmod_module_new_from_lookup() input alias=nvidiafb, normalized=nvidiafb
modprobe: DEBUG: ../libkmod/libkmod-module.c:535 kmod_module_new_from_lookup() lookup modules.dep nvidiafb
modprobe: DEBUG: ../libkmod/libkmod.c:544 kmod_search_moddep() use mmaped index 'modules.dep' modname=nvidiafb
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='nvidiafb' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:400 kmod_pool_add_module() add 0x7fcd9e7e2760 key='nvidiafb'
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='vgastate' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='vgastate' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:400 kmod_pool_add_module() add 0x7fcd9e7e6540 key='vgastate'
modprobe: DEBUG: ../libkmod/libkmod-module.c:184 kmod_module_parse_depline() add dep: /lib/modules/3.13.0-74-generic/kernel/drivers/video/vgastate.ko
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='fb_ddc' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='fb_ddc' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:400 kmod_pool_add_module() add 0x7fcd9e7e26f0 key='fb_ddc'
modprobe: DEBUG: ../libkmod/libkmod-module.c:184 kmod_module_parse_depline() add dep: /lib/modules/3.13.0-74-generic/kernel/drivers/video/fb_ddc.ko
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='i2c_algo_bit' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:392 kmod_pool_get_module() get module name='i2c_algo_bit' found=(nil)
modprobe: DEBUG: ../libkmod/libkmod.c:400 kmod_pool_add_module() add 0x7fcd9e7e6810 key='i2c_algo_bit'
modprobe: DEBUG: ../libkmod/libkmod-module.c:184 kmod_module_parse_depline() add dep: /lib/modules/3.13.0-74-generic/kernel/drivers/i2c/algos/i2c-algo-bit.ko
modprobe: DEBUG: ../libkmod/libkmod-module.c:190 kmod_module_parse_depline() 3 dependencies for nvidiafb
modprobe: DEBUG: ../libkmod/libkmod-module.c:556 kmod_module_new_from_lookup() lookup nvidiafb=0, list=0x7fcd9e7e26d0
modprobe: DEBUG: ../libkmod/libkmod-module.c:441 kmod_module_unref() kmod_module 0x7fcd9e7e2760 released
modprobe: DEBUG: ../libkmod/libkmod.c:408 kmod_pool_del_module() del 0x7fcd9e7e2760 key='nvidiafb'
modprobe: DEBUG: ../libkmod/libkmod-module.c:441 kmod_module_unref() kmod_module 0x7fcd9e7e6810 released
modprobe: DEBUG: ../libkmod/libkmod.c:408 kmod_pool_del_module() del 0x7fcd9e7e6810 key='i2c_algo_bit'
modprobe: DEBUG: ../libkmod/libkmod-module.c:441 kmod_module_unref() kmod_module 0x7fcd9e7e26f0 released
modprobe: DEBUG: ../libkmod/libkmod.c:408 kmod_pool_del_module() del 0x7fcd9e7e26f0 key='fb_ddc'
modprobe: DEBUG: ../libkmod/libkmod-module.c:441 kmod_module_unref() kmod_module 0x7fcd9e7e6540 released
modprobe: DEBUG: ../libkmod/libkmod.c:408 kmod_pool_del_module() del 0x7fcd9e7e6540 key='vgastate'
modprobe: INFO: ../libkmod/libkmod.c:321 kmod_unref() context 0x7fcd9e7e22e0 released

installed the boot script listed under ‘device node verification’ here http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#runfile-verifications , which fails as above.

  1. changed grub to do a text-only boot (tho that happens anyway with these remote machines) , rebooted, same story - no /dev/nvidia

  2. re-ran runfile ,result as before is

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-7.5
Samples:  Installed in /root, but missing recommended libraries
  1. reboot, still no /dev/nvidia, and nvidia-smi still fails to communicate with driver…

Did you start with a clean OS install? Or were there any attempts at GPU SW install prior to what you show here?

I think your process for disabling nouveau is incomplete. Follow the instructions in the linux install guide:

[url]http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau[/url]

After removing nouveau from the initrd image, it’s necessary to reboot.

Upon reboot, run nvidia-smi as root.

thanks for your reply

Yes this was a clean ubuntu install other than matlab and opencv - no gpu install attempts for either.

Although I have text only interface in any case, I redid the full nouveau disable on the same hunch you had (I had done the sudo update-initramfs -u in any case the first time around , i forgot to list it above in step 1)

root@brain2:~# lsmod|grep nouveau
root@brain2:~#

, rebooted, still no love. Redid the runfile install, no love.

Tried the package-manager approach,

root@brain2:/home/jeremy# sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb 
(Reading database ... 117312 files and directories currently installed.)
Preparing to unpack cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb ...
Unpacking cuda-repo-ubuntu1404-7-5-local (7.5-18) over (7.5-18) ...
Setting up cuda-repo-ubuntu1404-7-5-local (7.5-18) ...
OK
root@brain2:/home/jeremy# sudo apt-get update 
Ign file:  InRelease
Get:1 file:  Release.gpg [181 B]
Get:2 file:  Release [196 B]                                       
Ign file:  Translation-en_US                                   
Ign file:  Translation-en                                 
Ign http://archive.ubuntu.com precise InRelease           
Hit http://archive.ubuntu.com precise Release.gpg
Hit http://archive.ubuntu.com precise Release
Hit http://archive.ubuntu.com precise/main amd64 Packages
Hit http://archive.ubuntu.com precise/universe amd64 Packages
Hit http://archive.ubuntu.com precise/main i386 Packages
Hit http://archive.ubuntu.com precise/universe i386 Packages
Hit http://archive.ubuntu.com precise/main Translation-en
Hit http://archive.ubuntu.com precise/universe Translation-en
Ign http://archive.ubuntu.com precise/main Translation-en_US
Ign http://archive.ubuntu.com precise/universe Translation-en_US
Reading package lists... Done 
root@brain2:/home/jeremy# sudo apt-get install cuda 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-7-5 (= 7.5-18) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

apparently standard fix for tihs doesnt work:

root@brain2:/home/jeremy# sudo apt-get -f install
Reading package lists... Done
Building dependency tree       
Reading state information... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

so…anybody have a clue for me? maybe I have to uninstall the runfile attempt remnants first, then try .deb install?

ok , after uninstalling (sh …deb -silent --uninstall , and running the uninstaller from …/cuda7.5/bin) then trying the .deb install again led to the same fail.
So I tried instuctions here for gpu on ubuntu using the cuda7.0 runfile:

and hit the following:

ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 346.46 -k 3.13.0-74-generic`:
         Kernel preparation unnecessary for this kernel.  Skipping...
  
         Building module:                                                                                                                                                                                               
         cleaning build area....                                         
         make KERNELRELEASE=3.13.0-74-generic module KERNEL_UNAME=3.13.0-74-generic; make -C uvm module KERNEL_UNAME=3.13.0-74-generic KBUILD_EXTMOD=/var/lib/dkms/nvidia/346.46/build/uvm.......(bad exit status: 2)   
         Error! Bad return status for module build on kernel: 3.13.0-74-generic (x86_64)
         Consult /var/lib/dkms/nvidia/346.46/build/make.log for more information.

now tried again without dkms, this time get
ERROR: Unable to build the NVIDIA kernel module.

maybe this helps diagnose:

root@brain2:/home/jeremy# apt-cache policy cuda
cuda:
  Installed: (none)
  Candidate: 7.5-18
  Version table:
     7.5-18 0
        500 file:/var/cuda-repo-7-5-local/  Packages

anyone with any clues? I am still in the same state - after several rounds of uninstall/reinstall I still get

root@brain2:~# ./cuda_7.5.18_linux.run

= Summary =

Driver: Installed
Toolkit: Installed in /usr/local/cuda-7.5
Samples: Installed in /root

root@brain2:~# nvidia-smi
modprobe: FATAL: Module nvidia not found.
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I tried another route, this time using the repository :

sudo add-apt-repository ppa:xorg-edgers/ppa
apt-get install nvidia-current
apt-get install nvidia-current-updates

and now get from deviceQuery

root@brain2:/usr/local/cuda/samples/1_Utilities/deviceQuery# ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

which actually looks like progress since the device is found at least.

As yet another route I tried downloading the .deb driver after once again uninstalling .

root@brain2:~# ./NVIDIA-Linux-x86_64-352.79.run

hits ‘error:unable to build the nvidia kernel module’.I look at the errlog:

root@brain2:~# more /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Feb  4 11:40:23 2016
installer version: 352.79

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda-7.5/bin

nvidia-installer command line:
    ./nvidia-installer

Using: nvidia-installer ncurses user interface
-> Detected 48 CPUs online; setting concurrency level to 48.
-> License accepted.
-> Installing NVIDIA driver version 352.79.
-> There appears to already be a driver installed on your system (version: 352.79).  As part of installing this driver (version: 352.79), the ex
isting driver will be uninstalled.  Are you sure you want to continue? (Answer: Continue installation)
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a di
fferent kernel later. (Answer: No)
-> Performing CC sanity check with CC="/usr/bin/cc".
-> Kernel source path: '/lib/modules/3.13.0-74-generic/build'
-> Kernel output path: '/lib/modules/3.13.0-74-generic/build'
-> Performing rivafb check.
-> Performing nvidiafb check.
-> Performing Xen check.
-> Performing PREEMPT_RT check.
-> Cleaning kernel module build directory.
   executing: 'cd ./kernel; /usr/bin/make clean'...
-> Building NVIDIA kernel module:
   executing: 'cd ./kernel; /usr/bin/make module SYSSRC=/lib/modules/3.13.0-74-generic/build SYSOUT=/lib/modules/3.13.0-74-generic/build -j48  N
V_BUILD_MODULE_INSTANCES='...
   NVIDIA: calling KBUILD...
   make[1]: Entering directory `/usr/src/linux-headers-3.13.0-74-generic'
   test -e include/generated/autoconf.h -a -e include/config/auto.conf || (		\
   	echo >&2;							\
   	echo >&2 "  ERROR: Kernel configuration is invalid.";		\
   	echo >&2 "         include/generated/autoconf.h or include/config/auto.conf are missing.";\
   	echo >&2 "         Run 'make oldconfig && make prepare' on kernel src to fix it.";	\
   	echo >&2 ;							\
   	/bin/false)

etc.
anyone have any knowledge about this?