IBM Power8: CUDA driver version is insufficient for CUDA runtime version

Hi,

Here is the issue that I have been facing. I have a S822LC system with 2-socket (10 core) power8 processor and 4x Tesla P100 cards. I have installed Ubuntu 16.04 on the system. I am now trying to install CUDA 8.0 and Nvidia driver 361.93.03 for Tesla P100 cards. And I am getting the following error while trying to execute bandwidhTest from CUDA samples:

cudaGetDeviceProperties returned 35
→ CUDA driver version is insufficient for CUDA runtime version
CUDA error at bandwidthTest.cu:242 code=35(cudaErrorInsufficientDriver) “cudaSetDevice(currentDevice)”

Here is the output of the nvidia-smi command:

Mon Nov 28 17:04:35 2016
±----------------------------------------------------------------------------+
| NVIDIA-SMI 361.93.03 Driver Version: 361.93.03 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2… Off | 0002:01:00.0 Off | 0 |
| N/A 42C P0 39W / 300W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-SXM2… Off | 0003:01:00.0 Off | 0 |
| N/A 35C P0 37W / 300W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P100-SXM2… Off | 000A:01:00.0 Off | 0 |
| N/A 34C P0 36W / 300W | 0MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla P100-SXM2… Off | 000B:01:00.0 Off | 0 |
| N/A 34C P0 37W / 300W | 0MiB / 16280MiB | 2% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvidia-smi detects the 4x Tesla P100 cards but I still get the "CUDA driver version is insufficient for CUDA runtime version"error while trying to execute CUDA samples. By any chance would you know the right version of CUDA and nvidia driver in this case?

Thanks
Shivi

Did you get the 361.93.03 driver from here:

[url]http://www.nvidia.com/download/driverResults.aspx/110987/en-us[/url]

Did you get the CUDA 8 toolkit from here:

[url]https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el-deb[/url]

(i.e. from here: [url]https://developer.nvidia.com/cuda-downloads[/url])

If so, my suspicion is a runfile installer/deb installer clash. The potential for this is outlined in the linux getting started guide:

[url]https://developer.nvidia.com/compute/cuda/8.0/prod/docs/sidebar/CUDA_Installation_Guide_Linux-pdf[/url]

i.e. section 2.7 here:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

To get things working for you, you could start by just using the cuda 8 deb method to install CUDA 8 (don’t install a driver separately) and it should pull in 361.93.02, and that driver is compatible with tesla P100 also.

To get things working with 361.93.03, I would try the following process, but I have not personally walked through it, so I can’t vouch for it yet:

using the CUDA 8 toolkit Power deb as indicated above, and assuming a clean Ubuntu 16.04 install:

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb
sudo apt-get update
sudo apt-get install cuda-toolkit

Then use the runfile install method using the 361.93.03 runfile installer I indicated above.

If you used other methods to install CUDA 8 toolkit or 361.93.03 driver than what I initially surmised above, I would not pay attention to anything I said here. Instead, if you want help, identify the exact places where you acquired 361.93.03 and CUDA 8 tookit for Power, and identify the exact method you used to install.

Thanks so much for getting back to me.

I already tried the first approach where I didn’t install any drivers separately and only installed cuda. In that case even the nvidia-smi command didn’t work.

Now, I want to try out the second approach and am trying to find the runfile for 361.93.03 driver but couldn’t.

JSYK, I got the 361.93.03 driver (deb package) from here :
http://www.nvidia.com/download/driverResults.aspx/109509/en-us

I got the one for Ubuntu, the one that you mentioned is for RHEL.

Any other suggestions?

Yes, my mistake there is no published runfile, only .deb package for ubuntu and .rpm for RHEL.

What did you use to install CUDA 8? Was it from the deb link I mentioned?

Yes, it is from the same deb link.

Thanks

This worked for me. I just tested it now. A colleague of mine set up an IBM S822LC for HPC system with 4 P100 GPUs in it, with a fresh load of ubuntu 16.04.1 LTS:

# uname -a
Linux nchpc-g0 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:38:24 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

I then grabbed the aforementioned .deb file:

https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el-deb

And did all the following as root:

  • rename the downloaded file from -deb to .deb
  • dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_ppc64el.deb
  • apt-get update
  • apt-get install cuda

This process visibly includes installing the nvidia-361 package which is the nvidia driver. You can see clear evidence of the 361.93.02 driver kernel module being built and installed.

After that, again as root, without even a reboot, I did:

# nvidia-smi
Tue Nov 29 22:15:25 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.93.02              Driver Version: 361.93.02                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0002:01:00.0     Off |                    0 |
| N/A   34C    P0    35W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0003:01:00.0     Off |                    0 |
| N/A   31C    P0    36W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 000A:01:00.0     Off |                    0 |
| N/A   33C    P0    35W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 000B:01:00.0     Off |                    0 |
| N/A   31C    P0    36W / 300W |      0MiB / 16280MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you’re not able to duplicate that sequence, I’m not sure what the difference may be.

Thanks so much for getting back to me. I am trying a fresh install of Ubuntu here. Can you please confirm if you can run bandwidthTest or deviceQuery?

Thanks much.

Best
Shivani

So far I have done everything as root. Yes I am able to successfully run CUDA sample codes:

# /usr/local/cuda/samples/bin/ppc64le/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     29517.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     21339.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     449507.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
# /usr/local/cuda/samples/bin/ppc64le/linux/release/deviceQuery
/usr/local/cuda/samples/bin/ppc64le/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   3 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   10 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   11 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS
#

Interesting to see the host<->device bandwidth for this system, at about twice the PCIe gen3 x16 rate (typically 12 GB/sec)

In fact those numbers (~21GB/s and ~29GB/s) are a bit low. The link peak theoretical throughput is 40GB/s for each direction, and we typically see ~32GB/s there in either direction as a measurement. There is something not quite right with the box I was testing on – for example it may have incorporated some pre-production hardware, as all of this is pretty new.

My intent in posting that was to demonstrate that from a software install perspective, CUDA was functional. I did not mean it to be representative of current hardware behavior from a performance perspective.