Hi,
I have question about operation on shared variable. It’s about timing issue. Below is the snapshot example where the problem starts. And I don’t understand what actually happens.
__shared__ cuFloatComplex fj[BLOCKSIZE];
int threadId = threadIdx.x + threadIdx.y*blockDim.x;
int i = blockIdx.x;
int j = blockIdy.y;
fj[threadId] = make_cuFloatComplex(0,0);
__syncthreads();
if(threadId == 0)
{
cuFloatComplex sum = make_cuFloatComplex(0,0);
for(int index = 0; index < BLOCKSIZE; index++)
sum += fj[index];
f_in[j+i*gridDim.x] = sum;
}
In the code above, I use shared variable fj to calculate sum and store in f_in. It takes about 0.68s. But if I have like
__shared__ cuFloatComplex fj[BLOCKSIZE];
int threadId = threadIdx.x + threadIdx.y*blockDim.x;
int i = blockIdx.x;
int j = blockIdy.y;
fj[threadId] = make_cuFloatComplex(0,0);
__syncthreads();
if(threadId == 0)
{
cuFloatComplex sum = make_cuFloatComplex(0,0);
for(int index = 0; index < BLOCKSIZE; index++)
sum += make_cuFloatComplex(0,0);
f_in[j+i*gridDim.x] = sum;
}
This one takes only 0.19s. The two codes has the same output. However, one is used shared variable, and one is not. Also, in the first example, if I assign “sum” with new value before assigning it to “f_in”, it takes only 0.19s, same as second example.
__shared__ cuFloatComplex fj[BLOCKSIZE];
int threadId = threadIdx.x + threadIdx.y*blockDim.x;
int i = blockIdx.x;
int j = blockIdy.y;
fj[threadId] = make_cuFloatComplex(0,0);
__syncthreads();
if(threadId == 0)
{
cuFloatComplex sum = make_cuFloatComplex(0,0);
for(int index = 0; index < BLOCKSIZE; index++)
sum += fj[index];
sum = make_cuFloatComplex(0,0);
f_in[j+i*gridDim.x] = sum;
}
Is there any issue with my codes? By the way, here is my code for “+=” operator
__host__ __device__ void operator+=(cuFloatComplex &a, cuFloatComplex b)
{
a = cuCaddf(a,b);
}
Thanks in advance.