Performance Issues with CUDA and Python/R

Dear All,

I have recently build a workstation with dual Titan X to speed up my work in Python and R.
Previously I have OpenBLAS alone which works really well. I have set up everything on Arch Linux
and it would seem that there were no problems during installation.

nvidia-settings --version

nvidia-settings: version 367.35 (builduser@rw) Fri Jul 15 21:07:27 CEST 2016
The NVIDIA X Server Settings tool.

Results from CUDA testing scripts (deviceQuery):

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “GeForce GTX TITAN X”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “GeForce GTX TITAN X”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12204 MBytes (12796297216 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from GeForce GTX TITAN X (GPU0) → GeForce GTX TITAN X (GPU1) : Yes
Peer access from GeForce GTX TITAN X (GPU1) → GeForce GTX TITAN X (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = GeForce GTX TITAN X, Device1 = GeForce GTX TITAN X
Result = PASS

While everything looks fine my tests with both Python and R all fail to show any performance gains. Not only that,
the performance is significantly worse relative to OpenBLAS on an 8 core CPU (Intel Haswell). Shortest example I
can think of:

test.R
A ← matrix(rnorm(1000010000), ncol = 10000)
B ← A %
% A

nvblas.conf
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so

For CPU I also tried libopenblas.so.3 and libopenblas_haswellp-r0.2.18.so, with and without paths

NVBLAS_GPU_LIST ALL

Tried individual GPUs (even slower) and ALL0

NVBLAS_TILE_DIM 2048

Tried different numbers, usually either similar performance or worse

NVBLAS_AUTOPIN_MEM_ENABLED

Tried with and without

time Rscript test.R
real 0m14.291s
user 1m49.483s
sys 0m7.053s

time env LD_PRELOAD=libnvblas.so.7.5.18 R CMD BATCH test.R

I also tries libnvblas.so, libnvblas.so.7.5, submitting full path, with/without env or export etc.

[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
real 1m40.445s
user 1m34.947s
sys 0m23.587s

nvidia-smi

±----------------------------------------------------------------------------+
Wed Aug 24 12:19:29 2016
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT… Off | 0000:01:00.0 On | N/A |
| 36% 76C P2 87W / 250W | 1586MiB / 12203MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX TIT… Off | 0000:02:00.0 Off | N/A |
| 29% 68C P2 72W / 250W | 500MiB / 12206MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 692 G /usr/lib/xorg-server/Xorg 421MiB |
| 0 750 G /usr/bin/gnome-shell 481MiB |
| 0 1287 G …s-passed-by-fd --v8-snapshot-passed-by-fd 182MiB |
| 0 19237 C sh 123MiB |
| 0 19340 C /usr/lib64/R/bin/exec/R 123MiB |
| 1 19237 C sh 123MiB |
| 1 19340 C /usr/lib64/R/bin/exec/R 123MiB |
±----------------------------------------------------------------------------+

±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT… Off | 0000:01:00.0 On | N/A |
| 36% 79C P2 183W / 250W | 1911MiB / 12203MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX TIT… Off | 0000:02:00.0 Off | N/A |
| 29% 72C P2 163W / 250W | 824MiB / 12206MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 692 G /usr/lib/xorg-server/Xorg 422MiB |
| 0 750 G /usr/bin/gnome-shell 484MiB |
| 0 1287 G …s-passed-by-fd --v8-snapshot-passed-by-fd 180MiB |
| 0 3639 C sh 123MiB |
| 0 3740 C /usr/lib64/R/bin/exec/R 443MiB |
| 1 3639 C sh 123MiB |
| 1 3740 C /usr/lib64/R/bin/exec/R 443MiB |
±----------------------------------------------------------------------------+

The utilization and memory something increases a bit but only for brief moments
(even for longer scripts, usually between 3 and 30%). I have tried tens of
different test scripts (PCA, SVD etc.) in both python and R (including spectral
clustering for which CPUs alone is ~15 min faster than dual Titan X setup)
and I consistently get much faster performance with OpenBLAS alone.

This seems really strange to me as I was expecting at least 2-3 times faster
execution with such powerful GPUs (hence the purchase). Could someone please advise?
I tried following the short guidelines online:

I am tempted to setup Ubuntu instead of Arch to see if maybe having older drivers
would help but I need to be sure it’s worth the time. No errors and the scripts do
access the GPU so absolutely no idea why it runs slower.

Thanks in advance for any advice.

I don’t use either Python or R, but wonder whether the builds of these that you use actually include support for GPU acceleration, and if they do, whether that support has been properly turned on.

Hey njuffa,

Thanks for your comment. According to Nvidia it does:

Drop-in NVBLAS Library on Linux
Wrapper of cuBLAS
Includes Standard BLAS3 routines, such as SGEMM
Supports Multiple-GPUs
ZERO programming effort

Q: How to use it with R ?
A: Simple PRE-LOAD nvblas.so on Linux
Normally: R CMD BATCH .R
NVBLAS: env LD_PRELOAD=libnvblas.so R CMD BATCH .R

They also mention that this exact function can be accelerated:

nvBLAS can speedup this step

testXtrain ← as.matrix(testdata) %*% t(traindata)

FYI “R CMD BATCH” is slower than “Rscript”. Unfortunately libnvblas is still slower than libopenblas even when the script in both cases is called through Rscript (although the difference is smaller than in the example above).

Would really appreciate help with this. Spend a large chunk of my research budget on the hardware to speed up my work so really keen on making it work.

If R is using double precision data, Titan X is not the right card to use. Double precision performance of Titan X is pretty low.

Thanks mfatica, I was not aware of that. Tesla cards are way too expensive and it seems GTX range might not have any utility for machine learning except neural networks… Very disappointing.

Hi Fr45,

It’s great that you’re inspired by my presentation of R+CUDA and try GPU solutions.

Several suggestions:

  1. Using only 1 GPU for GEMM
    Since the size of matrices is not such huge in your case, communication cost will take into account.
    #NVBLAS_GPU_LIST ALL
    NVBLAS_GPU_LIST 0

  2. Switch to single precision computations for TITAN X
    As we know the default precision in R is double, but we can use ‘single’ for performance.

    The ‘gmatrix’ and ‘gpuR’ packages provide the single precision computation for R.

    gmatrix

    A_f = gmatrix(rnorm(ORDER^2),ORDER,ORDER,type=“single”) #gmatrix
    B_f = gmatrix(rnorm(ORDER^2),ORDER,ORDER,type=“single”) #gmatrix
    C_f=gmatrix(0,ORDER,ORDER,type=“single”)
    gmm(A_f,B_f,C_f)

    transfer data back to CPU

    C_h = h(C_f)

    gpuR

    vclA_f = vclMatrix(A, nrow = ORDER, type=“float”) #gpuR seperate memory transfer
    vclB_f = vclMatrix(B, nrow = ORDER, type=“float”) #gpuR seperate memory transfer
    vclC_f = vclA_f %*% vclB_f
    more details about gpuR in: http://www.parallelr.com/r-gpu-programming-for-all-with-gpur/

    NOTE: A little different based on current version, gmatrix will transfer double input data into GPU first and then convert to single precision and do computations while gpuR will do data type changes on CPU first so it can save memory bandwidth and vcl* class API is a kind of nonblocking API so it will return immediately.

  3. Profiling to see where is the bottleneck
    try: > env LD_PRELOAD=libnvblas.so nvprof R
    do normal command and exit
    see the output from nvprof and figure out where is the bottleneck.

I will try to reproduce your case on our TITAN X and update the results to you.
Feel free to let me know for the further questions.

BR,

–Patric

Hi Patric,

Many thanks for the great slides and your reply here.

I have tried a single GPU before (GPU 1 which is not driving the display) but using two GPUs is couple of seconds faster for the example I used above.

I tried nvprof - not sure if it saves a more detailed profile but this is what I got:

[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
==1689== NVPROF is profiling process 1689, command: Rscript test.R
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
==1689== Profiling application: Rscript test.R
==1689== Profiling result:
No kernels were profiled.

==1689== API calls:
No API activities were profiled.
==1689== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed

If your test show that single precision is indeed behind the slower performance that it’s quite a shame - I’m using a lot of existing libraries so I would need to rewrite everything to make it work faster. Would be great if R or python had a switch to go to float32 by default.

Looking forward to your results!

BTW,regarding below:

'# nvBLAS can speedup this step
testXtrain ← as.matrix(testdata) %*% t(traindata)

To avoid transform on CPU which is very slow, you can change to :

testXtrain ← tcrossprod(testdata, traindata)

Sorry in my slide I don’t talk this point but for the small size of workload this chang is also important.

Another example for removing t() in here:
http://www.parallelr.com/r-dnn-parallel-acceleration/

The best double precision performance per dollar is, by far, found with a Kepler based Titan or Titan Black, not a Maxwell or Pascal Titan X.

A Titan Black is about $500 on eBay (used) and gives you about 1700 DP gigaflops. A Titan is about $350 and gives about 1400 DP gigaflops.

Your Maxwell Titan X is only about 192 DP gigaflops, but has a tremendous single precision throughput of 6000 gigaflops. Pascal Titan X is 300 DP, 10000 SP.

If you want the most powerful double precision option, a DGX-1 is a totally different class and gives you 37600 DP gigaflops for $129,000.

Got results on Xeon E5645 @2.4GHz 10 physical cores with 2X TITAN X.

OpenBLAS:

A ← matrix(rnorm(1000010000), ncol = 10000)
system.time(B ← A %
% A)
user system elapsed
211.331 8.642 19.379

nvBLAS with 1 TITAN X

A ← matrix(rnorm(1000010000), ncol = 10000)
system.time(B ← A %
% A)
user system elapsed
10.506 1.887 12.437

proc.time()
user system elapsed
33.987 11.696 37.406
[NVBLAS] Using devices :0
[NVBLAS] Config parsed
[NVBLAS] Using devices :0
[NVBLAS] Config parsed

nvBLAS with 2 TITAN X

A ← matrix(rnorm(1000010000), ncol = 10000)
system.time(B ← A %
% A)
user system elapsed
10.937 2.943 8.647

proc.time()
user system elapsed
36.662 14.405 35.808
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed

gmatrix float type 1 TITAN X

library(gmatrix)
Now using device 0 - “GeForce GTX TITAN X”
Starting cublas on device 0.
Creating new states on device 0.

Attaching package: gmatrix

The following object is masked from package:base:

svd

A ← matrix(rnorm(10000*10000), ncol = 10000)
A_f ← gmatrix(A, 10000, 10000,type=“single”)
B_f ← gmatrix(0, 10000, 10000,type=“single”)
system.time(gmm(A_f, A_f, B_f))
user system elapsed
0.283 0.040 0.324

Regarding " I’m using a lot of existing libraries so I would need to rewrite everything to make it work faster. Would be great if R or python had a switch to go to float32 by default."

R is very flexible so it’s not hard to overwrite ‘%*%’ as below for the workaround.

for example:
"%%" ← function(A, B, C)
{
# some time using original R GEMM
if( … ) {
C ← A %
% B
} else {
#GPU float GEMM from gmatrix, gputools, gpuR or others GPU libs
gmm(A, B, C)
}
return(C)
}

Hi Patric,

Interesting, here are my results when measuring the time of just the %*% operation.

Rscript test.R
user system elapsed
100.837 8.204 6.997

Using 1 Titan X

env LD_PRELOAD=/opt/cuda/lib64/libnvblas.so Rscript test.R
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
user system elapsed
9.587 1.477 11.058
[NVBLAS] Using devices :1
[NVBLAS] Config parsed
[NVBLAS] Using devices :1
[NVBLAS] Config parsed

Using 2 Titan X

env LD_PRELOAD=/opt/cuda/lib64/libnvblas.so Rscript test.R
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
user system elapsed
9.743 1.116 5.798
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed

And the gmatrix example using a single GPU as it ignores the second one:

Rscript gpu.R
Loading required package: methods
Now using device 0 - “GeForce GTX TITAN X”
Starting cublas on device 0.
Creating new states on device 0.

Attaching package: ‘gmatrix’

The following object is masked from ‘package:base’:

svd

user system elapsed
0.300 0.047 0.348

So it seems that the the main issue does indeed lies in the single/double precision as mfatica proposed.

Still, on your setup a single Titan X beats the OpenBLAS which is not the case on my machine. Also the overall execution (not just a single multiplication) is slower on CUDA (even with 2 GPUs) than on OpenBLAS. Could you please give any advice on how to improve the performance for cases when modifying the script would take too long or would take longer than just running on OpenBLAS?

The preloading method is brilliant and makes it so easy to work with R and python libraries. However the longer the execution the more difference I see between OpenBLAS and CUDA (in favor of OpenBLAS, measuring the whole script execution using unix time). Perhaps it’s because my OpenBLAS is newer and/or CPU more powerful?

If there is a way to speed up execution with the preloading method that would be fantastic, otherwise I would need to start looking at rewriting scripts.

Again, thanks for your time and help!

Interesting observation:

time env LD_PRELOAD=/opt/cuda/lib64/libnvblas.so Rscript gpu.R
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
Loading required package: methods
Now using device 0 - “GeForce GTX TITAN X”
Starting cublas on device 0.
Creating new states on device 0.

Attaching package: ‘gmatrix’

The following object is masked from ‘package:base’:

svd

user system elapsed
0.290 0.053 0.346
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed
[NVBLAS] Using devices :0 1
[NVBLAS] Config parsed

real 0m16.374s
user 0m10.963s
sys 0m14.023s

time Rscript gpu.R
Loading required package: methods
Now using device 0 - “GeForce GTX TITAN X”
Starting cublas on device 0.
Creating new states on device 0.

Attaching package: ‘gmatrix’

The following object is masked from ‘package:base’:

svd

user system elapsed
0.330 0.026 0.361

real 0m9.356s
user 0m8.630s
sys 0m1.973s

The preloading and parsing the config file seems to take quite a lot of time. Any way to have it always take over BLAS 3 calculations and not read the config file so many times?

“So it seems that the main issue does indeed lies in the single/double precision as mfatica proposed.”

As mfatica and SPWorley mentioned, TITAN X is for single precision so it makes sense ‘float’ extremely faster :)

" Could you please give any advice on how to improve the performance for cases when modifying the script would take too long or would take longer than just running on OpenBLAS?"

I totally understand your points that preloading method is really great and convenient for R and python w/ zero programming efforts and modifying the script is not such easy :(

And I will think about and do several experiments whether we can get a workaround to switch to float type for nvblas in R.

If you can share more codes or representative case, it will be very helpful for further optimizations.

BTW, ‘%*%’ operation is hijacked by gmatrix as same as gmm function.

system.time(B_f ← A_f %*% A_f)
user system elapsed
0.304 0.020 0.325

Thanks

Please note that Fr45 is running the benchmark on Haswell Xeons which can perform 16 FLOPs/cycle using AVX2, while Patric is using Westmere Xeons which can only perform up to 4 FLOPs/cycle. This seems to fully explain the difference.

On a side note, I would also time the matrix multiplication inside R, after having performed an untimed matrix multiplication, so that the setup time is eliminated.

Patric - it is fantastic to hear that you are keen on doing some extra work on this, it is greatly appreciate.

Having a simple LD_PRELOAD command that could offload the work would be of amazing value to anyone with a GTX card and matrix*matrix calculations.

As for a sample scripts, so far I have downloaded several online/github scripts where people compared several operations to benchmark OpenBLAS. In real world scenarios however, it might be more tricky rewrite so I would also test longer functions.

I will try to get some scripts and message you. Again thanks for your input on the topic!