Dear All,
I have recently build a workstation with dual Titan X to speed up my work in Python and R.
Previously I have OpenBLAS alone which works really well. I have set up everything on Arch Linux
and it would seem that there were no problems during installation.
nvidia-settings --version
nvidia-settings: version 367.35 (builduser@rw) Fri Jul 15 21:07:27 CEST 2016
The NVIDIA X Server Settings tool.
Results from CUDA testing scripts (deviceQuery):
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: “GeForce GTX TITAN X”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: “GeForce GTX TITAN X”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12204 MBytes (12796297216 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Peer access from GeForce GTX TITAN X (GPU0) → GeForce GTX TITAN X (GPU1) : Yes
Peer access from GeForce GTX TITAN X (GPU1) → GeForce GTX TITAN X (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = GeForce GTX TITAN X, Device1 = GeForce GTX TITAN X
Result = PASS
While everything looks fine my tests with both Python and R all fail to show any performance gains. Not only that,
the performance is significantly worse relative to OpenBLAS on an 8 core CPU (Intel Haswell). Shortest example I
can think of:
test.R
A ← matrix(rnorm(1000010000), ncol = 10000)
B ← A %% A
nvblas.conf
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so
For CPU I also tried libopenblas.so.3 and libopenblas_haswellp-r0.2.18.so, with and without paths
NVBLAS_GPU_LIST ALL
Tried individual GPUs (even slower) and ALL0
NVBLAS_TILE_DIM 2048
Tried different numbers, usually either similar performance or worse
NVBLAS_AUTOPIN_MEM_ENABLED
Tried with and without
time Rscript test.R
real 0m14.291s
user 1m49.483s
sys 0m7.053s
time env LD_PRELOAD=libnvblas.so.7.5.18 R CMD BATCH test.R
I also tries libnvblas.so, libnvblas.so.7.5, submitting full path, with/without env or export etc.
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
real 1m40.445s
user 1m34.947s
sys 0m23.587s
nvidia-smi
±----------------------------------------------------------------------------+
Wed Aug 24 12:19:29 2016
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT… Off | 0000:01:00.0 On | N/A |
| 36% 76C P2 87W / 250W | 1586MiB / 12203MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX TIT… Off | 0000:02:00.0 Off | N/A |
| 29% 68C P2 72W / 250W | 500MiB / 12206MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 692 G /usr/lib/xorg-server/Xorg 421MiB |
| 0 750 G /usr/bin/gnome-shell 481MiB |
| 0 1287 G …s-passed-by-fd --v8-snapshot-passed-by-fd 182MiB |
| 0 19237 C sh 123MiB |
| 0 19340 C /usr/lib64/R/bin/exec/R 123MiB |
| 1 19237 C sh 123MiB |
| 1 19340 C /usr/lib64/R/bin/exec/R 123MiB |
±----------------------------------------------------------------------------+
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT… Off | 0000:01:00.0 On | N/A |
| 36% 79C P2 183W / 250W | 1911MiB / 12203MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX TIT… Off | 0000:02:00.0 Off | N/A |
| 29% 72C P2 163W / 250W | 824MiB / 12206MiB | 100% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 692 G /usr/lib/xorg-server/Xorg 422MiB |
| 0 750 G /usr/bin/gnome-shell 484MiB |
| 0 1287 G …s-passed-by-fd --v8-snapshot-passed-by-fd 180MiB |
| 0 3639 C sh 123MiB |
| 0 3740 C /usr/lib64/R/bin/exec/R 443MiB |
| 1 3639 C sh 123MiB |
| 1 3740 C /usr/lib64/R/bin/exec/R 443MiB |
±----------------------------------------------------------------------------+
The utilization and memory something increases a bit but only for brief moments
(even for longer scripts, usually between 3 and 30%). I have tried tens of
different test scripts (PCA, SVD etc.) in both python and R (including spectral
clustering for which CPUs alone is ~15 min faster than dual Titan X setup)
and I consistently get much faster performance with OpenBLAS alone.
This seems really strange to me as I was expecting at least 2-3 times faster
execution with such powerful GPUs (hence the purchase). Could someone please advise?
I tried following the short guidelines online:
I am tempted to setup Ubuntu instead of Arch to see if maybe having older drivers
would help but I need to be sure it’s worth the time. No errors and the scripts do
access the GPU so absolutely no idea why it runs slower.
Thanks in advance for any advice.