When porting the marchine learning framework I use to CUDA, I was very disappointed to see that for the type of operations I’m doing, CUDA is actually slower that CPU code. Most of my operations are matrix-vector multiplications, with sizes of the order of hundreds (ie 500x100). In order to see from which size CUBLAS sgemv is faster than CBLAS sgemv, I wrote this small benchmark :
[codebox]#include <time.h>
#include <cutil.h>
#include <cublas.h>
#include <mkl_cblas.h>
int main(int argc, char** argv)
{
int nbIter = 10000;
int m;
int n = 128;
for (int j = 0; j < 10; ++j) {
m = 16 << j;
// n = m;
printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);
float time;
float *mat, *x, *y;
float data = (float) malloc(sizeof(float) * m * n);
for (int i = 0; i < m*n; ++i)
data[i] = ((float)i) / ((float)(m * n));
unsigned int timer = 0;
// cuda test
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );
CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );
CUT_SAFE_CALL( cutStartTimer( timer));
for (int i = 0; i < nbIter; ++i)
{
cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);
}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer( timer));
time = cutGetTimerValue( timer);
// output results
printf( "CUDA Time: %f (ms)\n", time);
CUDA_SAFE_CALL( cudaFree(mat) );
CUDA_SAFE_CALL( cudaFree(x) );
CUDA_SAFE_CALL( cudaFree(y) );
CUT_SAFE_CALL( cutDeleteTimer( timer));
// cpu test
mat = (float*) malloc(m * n * sizeof(float));
x = (float*) malloc(n*sizeof(float));
y = (float*) malloc(m*sizeof(float));
memcpy(mat, data, m * n * sizeof(float));
memcpy(x, data, n * sizeof(float));
memcpy(y, data, m * sizeof(float));
clock_t start = clock();
for (int i = 0; i < nbIter; ++i)
{
cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);
}
printf(“CPU Time: %f (ms)\n”, (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);
free(mat);
free(x);
free(y);
free(data);
}
}[/codebox]
The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :
[codebox]-------------
Evaluating 10000 iterations for a matrix 16x128
CUDA Time: 214.681000 (ms)
CPU Time: 10.000000 (ms)
Evaluating 10000 iterations for a matrix 32x128
CUDA Time: 278.380005 (ms)
CPU Time: 10.000000 (ms)
Evaluating 10000 iterations for a matrix 64x128
CUDA Time: 278.065002 (ms)
CPU Time: 20.000000 (ms)
Evaluating 10000 iterations for a matrix 128x128
CUDA Time: 277.746002 (ms)
CPU Time: 30.000000 (ms)
Evaluating 10000 iterations for a matrix 256x128
CUDA Time: 278.177002 (ms)
CPU Time: 70.000000 (ms)
Evaluating 10000 iterations for a matrix 512x128
CUDA Time: 279.446991 (ms)
CPU Time: 140.000000 (ms)
Evaluating 10000 iterations for a matrix 1024x128
CUDA Time: 289.652008 (ms)
CPU Time: 310.000000 (ms)
Evaluating 10000 iterations for a matrix 2048x128
CUDA Time: 374.023987 (ms)
CPU Time: 630.000000 (ms)
Evaluating 10000 iterations for a matrix 4096x128
CUDA Time: 680.843018 (ms)
CPU Time: 1290.000000 (ms)
Evaluating 10000 iterations for a matrix 8192x128
CUDA Time: 1254.005005 (ms)
CPU Time: 2590.000244 (ms)[/codebox]
I also ran the same test for square matrix :
[codebox]-------------
Evaluating 10000 iterations for a matrix 16x16
CUDA Time: 89.642998 (ms)
CPU Time: 10.000000 (ms)
Evaluating 10000 iterations for a matrix 32x32
CUDA Time: 107.869003 (ms)
CPU Time: 0.000000 (ms)
Evaluating 10000 iterations for a matrix 64x64
CUDA Time: 164.585999 (ms)
CPU Time: 20.000000 (ms)
Evaluating 10000 iterations for a matrix 128x128
CUDA Time: 277.773987 (ms)
CPU Time: 30.000000 (ms)
Evaluating 10000 iterations for a matrix 256x256
CUDA Time: 506.329987 (ms)
CPU Time: 120.000000 (ms)
Evaluating 10000 iterations for a matrix 512x512
CUDA Time: 1154.552002 (ms)
CPU Time: 530.000000 (ms)
Evaluating 10000 iterations for a matrix 1024x1024
CUDA Time: 3484.691895 (ms)
CPU Time: 1960.000000 (ms)
Evaluating 10000 iterations for a matrix 2048x2048
CUDA Time: 7111.210938 (ms)
CPU Time: 17180.000000 (ms)
Evaluating 10000 iterations for a matrix 4096x4096
CUDA Time: 21080.605469 (ms)
CPU Time: 69410.000000 (ms)
Evaluating 10000 iterations for a matrix 8192x8192
CUDA Time: 80645.937500 (ms)
CPU Time: 308120.000000 (ms)[/codebox]
It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don’t need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).
My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.
I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !