operation on shared array

nhthanh87 · March 30, 2012, 8:08am

Hi,

I have question about operation on shared variable. It’s about timing issue. Below is the snapshot example where the problem starts. And I don’t understand what actually happens.

__shared__ cuFloatComplex fj[BLOCKSIZE];

  int threadId = threadIdx.x + threadIdx.y*blockDim.x;

  int i = blockIdx.x;

  int j = blockIdy.y; 

fj[threadId] = make_cuFloatComplex(0,0);

__syncthreads();

if(threadId == 0)

  {

    cuFloatComplex sum = make_cuFloatComplex(0,0);

for(int index = 0; index < BLOCKSIZE; index++)

      sum += fj[index];

f_in[j+i*gridDim.x] = sum;

  }

In the code above, I use shared variable fj to calculate sum and store in f_in. It takes about 0.68s. But if I have like

__shared__ cuFloatComplex fj[BLOCKSIZE];

  int threadId = threadIdx.x + threadIdx.y*blockDim.x;

  int i = blockIdx.x;

  int j = blockIdy.y; 

fj[threadId] = make_cuFloatComplex(0,0);

__syncthreads();

if(threadId == 0)

  {

    cuFloatComplex sum = make_cuFloatComplex(0,0);

for(int index = 0; index < BLOCKSIZE; index++)

      sum += make_cuFloatComplex(0,0);

f_in[j+i*gridDim.x] = sum;

  }

This one takes only 0.19s. The two codes has the same output. However, one is used shared variable, and one is not. Also, in the first example, if I assign “sum” with new value before assigning it to “f_in”, it takes only 0.19s, same as second example.

__shared__ cuFloatComplex fj[BLOCKSIZE];

  int threadId = threadIdx.x + threadIdx.y*blockDim.x;

  int i = blockIdx.x;

  int j = blockIdy.y; 

fj[threadId] = make_cuFloatComplex(0,0);

__syncthreads();

if(threadId == 0)

  {

    cuFloatComplex sum = make_cuFloatComplex(0,0);

for(int index = 0; index < BLOCKSIZE; index++)

      sum += fj[index];

sum = make_cuFloatComplex(0,0);

    f_in[j+i*gridDim.x] = sum;

  }

Is there any issue with my codes? By the way, here is my code for “+=” operator

__host__ __device__ void operator+=(cuFloatComplex &a, cuFloatComplex b)

{

	a = cuCaddf(a,b);

}

Thanks in advance.

Gilles_C · March 31, 2012, 7:46am

Hi,
I’d guessed it already, but I confirmed it by actually compiling the codes you showed. And it shows that the complier is very very clever!
What happens is that, since you try to accumulate zeros, the compiler detects you are doing nothing useful (0+O+…+0=0) and it just optimises it by removing it.
Simply, when you use shared memory, the fact that those data remain zero is not obvious, whereas when you just uses “sum += make_cuFloatComplex(0,0)”, then it detects it.
So this version is faster simply because it does almost nothing.
HTH