AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o

skchoe · February 21, 2012, 10:09pm

Dear all,

I wrote 2 kernels to see how much atomicAdd() to shared memory is faster than global memory.
Kernel is simple: Just keep add from i=0 … ITER-1, under 16 threads in a block across 256/16 blocks.

The result I cannot understand is:
atomicAdd to Shared memory - 140ms
atomicAdd to Direct to Global memory- 90ms

It would be so appreciated if you drop a line.

SK.

Here’s simple codes:
#define WARP_WIDTH 16
#define W 256
#define ITER 1000000

///////////////AtomicAdd to Shared memory ‘shd’//////////////////
global void kernel_shdatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
shared int shd[WARP_WIDTH];
shd[threadIdx.x] = in[j];

int i;
for(i=0;i<ITER;i++)
  atomicAdd((int*)&(shd[threadIdx.x]), i );

out[j] =shd[threadIdx.x];
__syncthreads();
return;

}

///////////////AtomicAdd to global memory ‘out’//////////////////
global void kernel_glbatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
int i;
for(i=0;i<ITER;i++)
atomicAdd((int*)&(out[j]), i);

__syncthreads();
return;

}

////////////////////////////////////////////////////////////////////
// kernel call

// to shared memory->global memory copy
kernel_shdatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_in, g_out);

// to global memory directly.
kernel_glbatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_ing, g_outg);

The time is measured by surrounding each of them including memory alloc/copy with cudaEvent…().

Gregory_Diamos · February 21, 2012, 10:40pm

The short answer is don’t use shared memory atomics if you care about performance.

skchoe · February 22, 2012, 1:24am

Thanks Gregory for reply.

Yes, it’s a rule of thumb “Avoid Atomics!”

The original intention of the question is on the comparison between Atomics to Global with Shared memory.I wanted to see the overhead to copy from/to global memory and shared memory will be relatively smaller if the cost atomics is so large for each thread. In the code 10000 times atomicAdd() would be costlier enough than one time copy between global/shared memory.

Am I looking into this situation legitimately as designed?

Thanks,

SK.

I just wanted to see the theory(?) :

in my example.