GET STARTED

GET INVOLVED

Authorization Required

Not a member? Register Now

[codebox]#include <time.h>

#include <cutil.h>

#include <cublas.h>

#include <mkl_cblas.h>

int main(int argc, char** argv)

{

int nbIter = 10000;

int m;

int n = 128;

for (int j = 0; j < 10; ++j) {

m = 16 << j;

// n = m;

printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);

float time;

float *mat, *x, *y;

float *data = (float*) malloc(sizeof(float) * m * n);

for (int i = 0; i < m*n; ++i)

data[i] = ((float)i) / ((float)(m * n));

unsigned int timer = 0;

// cuda test

CUT_SAFE_CALL( cutCreateTimer( &timer));

CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );

CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );

CUT_SAFE_CALL( cutStartTimer( timer));

for (int i = 0; i < nbIter; ++i)

{

cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);

}

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStopTimer( timer));

time = cutGetTimerValue( timer);

// output results

printf( "CUDA Time: %f (ms)\n", time);

CUDA_SAFE_CALL( cudaFree(mat) );

CUDA_SAFE_CALL( cudaFree(x) );

CUDA_SAFE_CALL( cudaFree(y) );

CUT_SAFE_CALL( cutDeleteTimer( timer));

// cpu test

mat = (float*) malloc(m * n * sizeof(float));

x = (float*) malloc(n*sizeof(float));

y = (float*) malloc(m*sizeof(float));

memcpy(mat, data, m * n * sizeof(float));

memcpy(x, data, n * sizeof(float));

memcpy(y, data, m * sizeof(float));

clock_t start = clock();

for (int i = 0; i < nbIter; ++i)

{

cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);

}

printf("CPU Time: %f (ms)\n", (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);

free(mat);

free(x);

free(y);

free(data);

}

}[/codebox]

The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x128

CUDA Time: 214.681000 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x128

CUDA Time: 278.380005 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x128

CUDA Time: 278.065002 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.746002 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x128

CUDA Time: 278.177002 (ms)

CPU Time: 70.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x128

CUDA Time: 279.446991 (ms)

CPU Time: 140.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x128

CUDA Time: 289.652008 (ms)

CPU Time: 310.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x128

CUDA Time: 374.023987 (ms)

CPU Time: 630.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x128

CUDA Time: 680.843018 (ms)

CPU Time: 1290.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x128

CUDA Time: 1254.005005 (ms)

CPU Time: 2590.000244 (ms)[/codebox]

I also ran the same test for square matrix :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x16

CUDA Time: 89.642998 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x32

CUDA Time: 107.869003 (ms)

CPU Time: 0.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x64

CUDA Time: 164.585999 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.773987 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x256

CUDA Time: 506.329987 (ms)

CPU Time: 120.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x512

CUDA Time: 1154.552002 (ms)

CPU Time: 530.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x1024

CUDA Time: 3484.691895 (ms)

CPU Time: 1960.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x2048

CUDA Time: 7111.210938 (ms)

CPU Time: 17180.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x4096

CUDA Time: 21080.605469 (ms)

CPU Time: 69410.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x8192

CUDA Time: 80645.937500 (ms)

CPU Time: 308120.000000 (ms)[/codebox]

It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don't need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).

My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.

I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !

[codebox]#include <time.h>

#include <cutil.h>

#include <cublas.h>

#include <mkl_cblas.h>

int main(int argc, char** argv)

{

int nbIter = 10000;

int m;

int n = 128;

for (int j = 0; j < 10; ++j) {

m = 16 << j;

// n = m;

printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);

float time;

float *mat, *x, *y;

float *data = (float*) malloc(sizeof(float) * m * n);

for (int i = 0; i < m*n; ++i)

data[i] = ((float)i) / ((float)(m * n));

unsigned int timer = 0;

// cuda test

CUT_SAFE_CALL( cutCreateTimer( &timer));

CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );

CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );

CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );

CUT_SAFE_CALL( cutStartTimer( timer));

for (int i = 0; i < nbIter; ++i)

{

cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);

}

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStopTimer( timer));

time = cutGetTimerValue( timer);

// output results

printf( "CUDA Time: %f (ms)\n", time);

CUDA_SAFE_CALL( cudaFree(mat) );

CUDA_SAFE_CALL( cudaFree(x) );

CUDA_SAFE_CALL( cudaFree(y) );

CUT_SAFE_CALL( cutDeleteTimer( timer));

// cpu test

mat = (float*) malloc(m * n * sizeof(float));

x = (float*) malloc(n*sizeof(float));

y = (float*) malloc(m*sizeof(float));

memcpy(mat, data, m * n * sizeof(float));

memcpy(x, data, n * sizeof(float));

memcpy(y, data, m * sizeof(float));

clock_t start = clock();

for (int i = 0; i < nbIter; ++i)

{

cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);

}

printf("CPU Time: %f (ms)\n", (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);

free(mat);

free(x);

free(y);

free(data);

}

}[/codebox]

The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x128

CUDA Time: 214.681000 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x128

CUDA Time: 278.380005 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x128

CUDA Time: 278.065002 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.746002 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x128

CUDA Time: 278.177002 (ms)

CPU Time: 70.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x128

CUDA Time: 279.446991 (ms)

CPU Time: 140.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x128

CUDA Time: 289.652008 (ms)

CPU Time: 310.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x128

CUDA Time: 374.023987 (ms)

CPU Time: 630.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x128

CUDA Time: 680.843018 (ms)

CPU Time: 1290.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x128

CUDA Time: 1254.005005 (ms)

CPU Time: 2590.000244 (ms)[/codebox]

I also ran the same test for square matrix :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x16

CUDA Time: 89.642998 (ms)

CPU Time: 10.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 32x32

CUDA Time: 107.869003 (ms)

CPU Time: 0.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 64x64

CUDA Time: 164.585999 (ms)

CPU Time: 20.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.773987 (ms)

CPU Time: 30.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 256x256

CUDA Time: 506.329987 (ms)

CPU Time: 120.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 512x512

CUDA Time: 1154.552002 (ms)

CPU Time: 530.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 1024x1024

CUDA Time: 3484.691895 (ms)

CPU Time: 1960.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 2048x2048

CUDA Time: 7111.210938 (ms)

CPU Time: 17180.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 4096x4096

CUDA Time: 21080.605469 (ms)

CPU Time: 69410.000000 (ms)

-------------

Evaluating 10000 iterations for a matrix 8192x8192

CUDA Time: 80645.937500 (ms)

CPU Time: 308120.000000 (ms)[/codebox]

It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don't need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).

My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.

I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !

The CUDA call overhead is not that obvious ..

What can be auto-detected and is done wrong very often, should be solved in the LIB.

At least a big hint in the docu of cuBLAS should be about this fact!

as the ATLAS lib is doing it since years (once inside installation process).

ATLAS = Automatically Tuned Linear Algebra Software

just compare Cpu_time(size) vs gpu_time(size).

This would make alot of cuBlas code much faster and easier!