CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU
When porting the marchine learning framework I use to CUDA, I was very disappointed to see that for the type of operations I'm doing, CUDA is actually slower that CPU code. Most of my operations are matrix-vector multiplications, with sizes of the order of hundreds (ie 500x100). In order to see from which size CUBLAS sgemv is faster than CBLAS sgemv, I wrote this small benchmark :

[codebox]#include <time.h>

#include <cutil.h>
#include <cublas.h>
#include <mkl_cblas.h>

int main(int argc, char** argv)
{

int nbIter = 10000;
int m;
int n = 128;

for (int j = 0; j < 10; ++j) {
m = 16 << j;
// n = m;
printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);

float time;
float *mat, *x, *y;

float *data = (float*) malloc(sizeof(float) * m * n);

for (int i = 0; i < m*n; ++i)
data[i] = ((float)i) / ((float)(m * n));

unsigned int timer = 0;

// cuda test
CUT_SAFE_CALL( cutCreateTimer( &timer));

CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );

CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );

CUT_SAFE_CALL( cutStartTimer( timer));

for (int i = 0; i < nbIter; ++i)
{
cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);
}

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer( timer));
time = cutGetTimerValue( timer);

// output results
printf( "CUDA Time: %f (ms)\n", time);

CUDA_SAFE_CALL( cudaFree(mat) );
CUDA_SAFE_CALL( cudaFree(x) );
CUDA_SAFE_CALL( cudaFree(y) );

CUT_SAFE_CALL( cutDeleteTimer( timer));

// cpu test
mat = (float*) malloc(m * n * sizeof(float));
x = (float*) malloc(n*sizeof(float));
y = (float*) malloc(m*sizeof(float));

memcpy(mat, data, m * n * sizeof(float));
memcpy(x, data, n * sizeof(float));
memcpy(y, data, m * sizeof(float));

clock_t start = clock();

for (int i = 0; i < nbIter; ++i)
{
cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);
}

printf("CPU Time: %f (ms)\n", (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);

free(mat);
free(x);
free(y);

free(data);
}
}[/codebox]

The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :

[codebox]-------------
Evaluating 10000 iterations for a matrix 16x128
CUDA Time: 214.681000 (ms)
CPU Time: 10.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 32x128
CUDA Time: 278.380005 (ms)
CPU Time: 10.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 64x128
CUDA Time: 278.065002 (ms)
CPU Time: 20.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 128x128
CUDA Time: 277.746002 (ms)
CPU Time: 30.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 256x128
CUDA Time: 278.177002 (ms)
CPU Time: 70.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 512x128
CUDA Time: 279.446991 (ms)
CPU Time: 140.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 1024x128
CUDA Time: 289.652008 (ms)
CPU Time: 310.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 2048x128
CUDA Time: 374.023987 (ms)
CPU Time: 630.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 4096x128
CUDA Time: 680.843018 (ms)
CPU Time: 1290.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 8192x128
CUDA Time: 1254.005005 (ms)
CPU Time: 2590.000244 (ms)[/codebox]

I also ran the same test for square matrix :

[codebox]-------------
Evaluating 10000 iterations for a matrix 16x16
CUDA Time: 89.642998 (ms)
CPU Time: 10.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 32x32
CUDA Time: 107.869003 (ms)
CPU Time: 0.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 64x64
CUDA Time: 164.585999 (ms)
CPU Time: 20.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 128x128
CUDA Time: 277.773987 (ms)
CPU Time: 30.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 256x256
CUDA Time: 506.329987 (ms)
CPU Time: 120.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 512x512
CUDA Time: 1154.552002 (ms)
CPU Time: 530.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 1024x1024
CUDA Time: 3484.691895 (ms)
CPU Time: 1960.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 2048x2048
CUDA Time: 7111.210938 (ms)
CPU Time: 17180.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 4096x4096
CUDA Time: 21080.605469 (ms)
CPU Time: 69410.000000 (ms)
-------------
Evaluating 10000 iterations for a matrix 8192x8192
CUDA Time: 80645.937500 (ms)
CPU Time: 308120.000000 (ms)[/codebox]

It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don't need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).

My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.

I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !
When porting the marchine learning framework I use to CUDA, I was very disappointed to see that for the type of operations I'm doing, CUDA is actually slower that CPU code. Most of my operations are matrix-vector multiplications, with sizes of the order of hundreds (ie 500x100). In order to see from which size CUBLAS sgemv is faster than CBLAS sgemv, I wrote this small benchmark :



[codebox]#include <time.h>



#include <cutil.h>

#include <cublas.h>

#include <mkl_cblas.h>



int main(int argc, char** argv)

{



int nbIter = 10000;

int m;

int n = 128;



for (int j = 0; j < 10; ++j) {

m = 16 << j;

// n = m;

printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);



float time;

float *mat, *x, *y;



float *data = (float*) malloc(sizeof(float) * m * n);



for (int i = 0; i < m*n; ++i)

data[i] = ((float)i) / ((float)(m * n));



unsigned int timer = 0;



// cuda test

CUT_SAFE_CALL( cutCreateTimer( &timer));



CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );



CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );



CUT_SAFE_CALL( cutStartTimer( timer));



for (int i = 0; i < nbIter; ++i)

{

cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);

}



CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStopTimer( timer));

time = cutGetTimerValue( timer);



// output results

printf( "CUDA Time: %f (ms)\n", time);



CUDA_SAFE_CALL( cudaFree(mat) );

CUDA_SAFE_CALL( cudaFree(x) );

CUDA_SAFE_CALL( cudaFree(y) );



CUT_SAFE_CALL( cutDeleteTimer( timer));



// cpu test

mat = (float*) malloc(m * n * sizeof(float));

x = (float*) malloc(n*sizeof(float));

y = (float*) malloc(m*sizeof(float));



memcpy(mat, data, m * n * sizeof(float));

memcpy(x, data, n * sizeof(float));

memcpy(y, data, m * sizeof(float));



clock_t start = clock();



for (int i = 0; i < nbIter; ++i)

{

cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);

}



printf("CPU Time: %f (ms)\n", (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);



free(mat);

free(x);

free(y);



free(data);

}

}[/codebox]



The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :



[codebox]-------------

Evaluating 10000 iterations for a matrix 16x128

CUDA Time: 214.681000 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x128

CUDA Time: 278.380005 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x128

CUDA Time: 278.065002 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.746002 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x128

CUDA Time: 278.177002 (ms)

CPU Time: 70.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x128

CUDA Time: 279.446991 (ms)

CPU Time: 140.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x128

CUDA Time: 289.652008 (ms)

CPU Time: 310.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x128

CUDA Time: 374.023987 (ms)

CPU Time: 630.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x128

CUDA Time: 680.843018 (ms)

CPU Time: 1290.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x128

CUDA Time: 1254.005005 (ms)

CPU Time: 2590.000244 (ms)[/codebox]



I also ran the same test for square matrix :



[codebox]-------------

Evaluating 10000 iterations for a matrix 16x16

CUDA Time: 89.642998 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x32

CUDA Time: 107.869003 (ms)

CPU Time: 0.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x64

CUDA Time: 164.585999 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.773987 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x256

CUDA Time: 506.329987 (ms)

CPU Time: 120.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x512

CUDA Time: 1154.552002 (ms)

CPU Time: 530.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x1024

CUDA Time: 3484.691895 (ms)

CPU Time: 1960.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x2048

CUDA Time: 7111.210938 (ms)

CPU Time: 17180.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x4096

CUDA Time: 21080.605469 (ms)

CPU Time: 69410.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x8192

CUDA Time: 80645.937500 (ms)

CPU Time: 308120.000000 (ms)[/codebox]



It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don't need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).



My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.



I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !

#1
Posted 02/11/2010 03:15 PM   
Without running your code, that sounds about right. Level 2 BLAS routines are really only faster at large sizes. The GPU does a lot better on level 3 BLAS - if you can restructure your code to do matrix-matrix rather than matrix-vector operations, you will get a big performance boost. In general, I have somewhat been able to get around the problem by fitting timing models and size metrics for the CUBLAS and BLAS routines I use. Call the timing model first to tell you what version of the code you should use. That way you use CUBLAS only when it makes sense to do so. For the smallest of sizes you are benching, even the host BLAS is overkill. You would be better doing those with inlined SSE code on the host.
Without running your code, that sounds about right. Level 2 BLAS routines are really only faster at large sizes. The GPU does a lot better on level 3 BLAS - if you can restructure your code to do matrix-matrix rather than matrix-vector operations, you will get a big performance boost. In general, I have somewhat been able to get around the problem by fitting timing models and size metrics for the CUBLAS and BLAS routines I use. Call the timing model first to tell you what version of the code you should use. That way you use CUBLAS only when it makes sense to do so. For the smallest of sizes you are benching, even the host BLAS is overkill. You would be better doing those with inlined SSE code on the host.

#2
Posted 02/11/2010 03:30 PM   
This doesn't seem too surprising... I would imagine that sgemv is going to launch one thread for each element of the output vector, and your average GPU doesn't get out of bed for a few hundred threads.
This doesn't seem too surprising... I would imagine that sgemv is going to launch one thread for each element of the output vector, and your average GPU doesn't get out of bed for a few hundred threads.

#3
Posted 02/11/2010 03:35 PM   
It would make sense if CUBLAS library would auto detect for small sizes and do fallback to CPU-BLAS in cases where no win is expected. This would prevent hundreds of projects to wounder about this. The CUDA call overhead is not that obvious .. What can be auto-detected and is done wrong very often, should be solved in the LIB. At least a big hint in the docu of cuBLAS should be about this fact!
It would make sense if CUBLAS library would auto detect for small sizes and do fallback to CPU-BLAS in cases where no win is expected. This would prevent hundreds of projects to wounder about this.
The CUDA call overhead is not that obvious ..
What can be auto-detected and is done wrong very often, should be solved in the LIB.
At least a big hint in the docu of cuBLAS should be about this fact!

#4
Posted 03/18/2014 11:04 PM   
It seems the results would be different respectively on a case-by-case basis. The detection needs to consider the hardware configuration and the software logic. Sometimes the interim result needs to be reused to avoid the bus bottleneck. There is alway trade-off between performance and generic. BTW, I am trying to design the implementation of the Libor Market Model and the Artificial Neural Network by CUDA.
It seems the results would be different respectively on a case-by-case basis. The detection needs to consider the hardware configuration and the software logic. Sometimes the interim result needs to be reused to avoid the bus bottleneck. There is alway trade-off between performance and generic. BTW, I am trying to design the implementation of the Libor Market Model and the Artificial Neural Network by CUDA.

#5
Posted 03/19/2014 03:09 AM   
"consider the hardware configuration" why not test for each cuBlas kernel where the threshold is, as the ATLAS lib is doing it since years (once inside installation process). ATLAS = Automatically Tuned Linear Algebra Software just compare Cpu_time(size) vs gpu_time(size). This would make alot of cuBlas code much faster and easier!
"consider the hardware configuration" why not test for each cuBlas kernel where the threshold is,
as the ATLAS lib is doing it since years (once inside installation process).
ATLAS = Automatically Tuned Linear Algebra Software
just compare Cpu_time(size) vs gpu_time(size).
This would make alot of cuBlas code much faster and easier!

#6
Posted 03/24/2014 01:32 PM   
Scroll To Top