Install CUDA drivers, toolkit and SDK for stateless nodes in a cluster Nonstandard install, how to c

Hi,

I wonder if some experts on CUDA installation (who know what’s happening during the process step by step) could help. I have to install CUDA for a cluster of GPU equipped nodes which boot not from local HD-s, but from a central server via network, from an image file prepared specially from a chrooted environment on the server. This network booting is done by Warewulf and works perfectly. Now I have to find the way the CUDA drivers and other CUDA software can be installed on this image in a way, that once booted by the nodes, everything works. Unfortunately, it is not as simple as running the downloaded nVidia installers…

I think, I have made significant progress on that, what’s more, I would expect it to work now, but unfortunately it does not. The present state is that if I run deviceQuery, I get the error message: “CUDA driver version is insufficient for CUDA runtime version”. I wonder, if you can help finding out, what I did wrong.

Let me sum up what I’ve done:

  1. Kernel driver install:

Since there is no CUDA capable GPU in the server (and I can’t insert one, since there’s no 16x PCIe slot), even to compile the kernel driver needed some tricks (the standard install mode fails, since it aborts when loading the driver is unsuccessful). I downloaded “devdriver_4.0_linux_64_270.41.19.run” from nVidia, extracted its content by the “-x” switch, went to the “kernel” directory in it, and issued the command “make module; make module-install”. This created and put the “nvidia.ko” kernel module in the appropriate place in the “/lib/modules/…/video/”, from where I copied it into the image file of the nodes. The kernel version on the nodes is exactly the same as the kernel version on the server.

I think, everything is fine up to this point, since when the nodes boot, the driver is loaded with no error seen in dmesg.

  1. CUDA libraries

The directory extracted from “devdriver_4.0_linux_64_270.41.19.run” in the previous step contained the already compiled *.so.270.41.19 files. These, studying the content of some nvidia lib rpm packages, prepackaged by volunteers, seems to be simply copied into /usr/lib64. This is what I did.

I’m not absolutely sure that I’m correct at this point. Can someone confirm it?

  1. CUDA toolkit install

I’ve downloaded and run the “cudatoolkit_4.0.17_linux_64_rhel6.0.run” package. As destination directory, I choose my home directory and not /usr/local, because the home directory is visible on (ifs mounted by) all nodes. Points 1. and 2. have done as root, but from this point I did all steps as regular user, but I don’t think it should matter, as long as I’m the only user.

I think this step is OK again, the installation went without error. After the installation I set my “LD_LIBRARY_PATH” to the respective “/home…/lib64” and “/home/…/lib” directories as asked by the installer.

  1. CUDA SDK install

I’ve downloaded and run the “gpucomputingsdk_4.0.17_linux.run” package, as normal user, and installed it into my home directory, as its default. After installing some dependencies which was turned out to be needed by the installer, the process was successful, all sample codes were compiled successfully. Again, since the home directory is shared, it’ll be visible by the nodes.

  1. Read. Try.

I guess, that’s all I had to do, I checked if it works. (Of course, it doesn’t :-( )

I booted the node. I checked the kernel module, it is loaded:

[pusztai@gpu01 ~]$ lsmod |grep nvidia

nvidia              10713027  0 

i2c_core               31274  1 nvidia

Since no X is running on the nodes, I used the short scrip from the nVidia PDF Guide to create the device drivers. I checked, they’re there, with the correct permission:

[pusztai@gpu01 ~]$ ls -al /dev/nvidia*

crw-rw-rw- 1 root root 195,   0 Dec 16 00:05 /dev/nvidia0

crw-rw-rw- 1 root root 195,   1 Dec 16 00:05 /dev/nvidia1

crw-rw-rw- 1 root root 195, 255 Dec 16 00:05 /dev/nvidiactl

I also checked the kernel version, etc.:

[pusztai@gpu01 ~]$  dmesg |tail -8

nvidia 0000:03:00.0: PCI INT A disabled

nvidia 0000:04:00.0: PCI INT A disabled

nvidia 0000:03:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16

nvidia 0000:03:00.0: setting latency timer to 64

vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=none,decodes=none:owns=io+mem

nvidia 0000:04:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18

nvidia 0000:04:00.0: setting latency timer to 64

NVRM: loading NVIDIA UNIX x86_64 Kernel Module  270.41.19  Mon May 16 23:32:08 PDT 2011

[pusztai@gpu01 ~]$ cat /proc/version 

Linux version 2.6.32-131.21.1.el6.x86_64 (mockbuild@c6b6.bsys.dev.centos.org) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Tue Nov 22 19:48:09 GMT 2011

[pusztai@gpu01 ~]$ cat /proc/driver/nvidia/version 

NVRM version: NVIDIA UNIX x86_64 Kernel Module  270.41.19  Mon May 16 23:32:08 PDT 2011

GCC version:  gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC)

I checked, that the nvidia libraries are in /usr/lib64/

[pusztai@gpu01 ~]$ ls -al /usr/lib64/*.270.41.19

-rwxr-xr-x 1 root root  1008272 Dec 15 18:59 /usr/lib64/libGL.so.270.41.19

-rwxr-xr-x 1 root root   155544 Dec 15 18:59 /usr/lib64/libXvMCNVIDIA.so.270.41.19

-rwxr-xr-x 1 root root  9259326 Dec 15 18:59 /usr/lib64/libcuda.so.270.41.19

-rwxr-xr-x 1 root root  6327720 Dec 15 18:59 /usr/lib64/libglx.so.270.41.19

-rwxr-xr-x 1 root root  2042224 Dec 15 18:59 /usr/lib64/libnvcuvid.so.270.41.19

-rwxr-xr-x 1 root root   133064 Dec 15 18:59 /usr/lib64/libnvidia-cfg.so.270.41.19

-rwxr-xr-x 1 root root 20498976 Dec 15 18:59 /usr/lib64/libnvidia-compiler.so.270.41.19

-rwxr-xr-x 1 root root 27484752 Dec 15 18:59 /usr/lib64/libnvidia-glcore.so.270.41.19

-rwxr-xr-x 1 root root    85464 Dec 15 18:59 /usr/lib64/libnvidia-ml.so.270.41.19

-rwxr-xr-x 1 root root     6008 Dec 15 18:59 /usr/lib64/libnvidia-tls.so.270.41.19

-r-xr-xr-x 1 root root   295416 Dec 15 18:59 /usr/lib64/libnvidia-wfb.so.270.41.19

-rwxr-xr-x 1 root root     4064 Dec 15 18:59 /usr/lib64/libvdpau.so.270.41.19

-rw-r--r-- 1 root root  1656744 Dec 15 18:59 /usr/lib64/libvdpau_nvidia.so.270.41.19

-rwxr-xr-x 1 root root    46872 Dec 15 18:59 /usr/lib64/libvdpau_trace.so.270.41.19

I checked, if my LD_LIBRARY_PATH is correct:

[pusztai@gpu01 ~]$ echo $LD_LIBRARY_PATH

/home/pusztai/cuda/lib64/:/home/pusztai/cuda/lib

So it seems, that everything is OK. But running deviceQuery fails:

[pusztai@gpu01 ~]$ /home/pusztai/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery

[deviceQuery] starting...

/home/pusztai/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35

-> CUDA driver version is insufficient for CUDA runtime version

[deviceQuery] test results...

FAILED

Press ENTER to exit...

Thanks for your patience for reading this extra long post, I just wanted to provide all details about what I did.

I’m stuck here. I wonder if any of you would have some ideas, hints how to proceed.

Thanks,

drBubo

Hi drBubo,

If I follow you, you installed the driver module on the node image and then booted the node and you did not see any devices. Every system has it’s own nuances so I won’t claim this is a fix but if you can then I would recommend you try logging into a running node to install the driver on the running image. Then, if your cluster manager allows you to, propagate that image across the other nodes. Try using nvidia-smi which is part of the driver installation to view and change properties of available devices. If that doesn’t work the rest won’t either…

Hope this is helpful

Hi drBubo, can you tell me how to extract the contents using the “-x” switch. I need the cuda library to compile my code in my laptop without GPU

Thanks

Hi drBubo, can you tell me how to extract the contents using the “-x” switch. I need the cuda library to compile my code in my laptop without GPU

Thanks