Device Memory Bandwidth

I have two different kernels. First just performs copy. Second performs copy + division. The bandwidth of second kernel seems to be higher. How is it possible?

For first kernel i got 58723 Mb/s (57.6 gb/s official data). For second kernel i got 80744 Mb/s. Device: 8800GT.

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{

uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid];

}

[/codebox]

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{

uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid] / eli2[tid];

}

[/codebox]

If your number is higher than theoretical peak, then there is something wrong in your measurement.

Did you synchronize(cudaThreadSynchronize()) your kernel before measuring the time?

I did the same in both cases.

[codebox]

DateTime start = DateTime.Now;

int numIterations = 1000;

for (int i = 0; i < numIterations; i++)

{

cuda.Launch(function, (int)(size + BlockSize - 1) / BlockSize, 1);

}

cuda.SynchronizeContext();

float Time = (float)(DateTime.Now - start).TotalMilliseconds;

Console.WriteLine(“Bandwidth: {0} Mb/s\n”, size * sizeof(int) * Time / numIterations);

[/codebox]

This is a guess, since I’m not sure what cuda.Launch is, but: Are you launching with 1 thread per block? Did you mean to use BlockSize as the number of threads instead of 1?

BlockSize is the number of threads per block. It is defined as 256.

Yeah, I realize that. But the way you are calling the cuda.Launch function it looks like you are passing 1 as the number of threads, when you should be passing blockSize.

but i have thread block of size 256 x 1. i do not use 2d block here.

Well, the problem was with formula: size * sizeof(int) * Time / numIterations
It should be: 0.000001 * numIterations * size * sizeof(int) / Time

Now i get next results: 60 gb/s for both kernels on GTX 275.
why it is so far from theoretical (127 gb/s) or from shown by bandwidth test (105000 mb/s)?

GPU RAM probably bottlenecked by GPU processor ;)