AtomicAdd in Shared memory is measured slower than in Global memory. Timing, Shared memory, Atomic o
Dear all,

I wrote 2 kernels to see how much atomicAdd() to shared memory is faster than global memory.
Kernel is simple: Just keep add from i=0 .. ITER-1, under 16 threads in a block across 256/16 blocks.

The result I cannot understand is:
atomicAdd to Shared memory - 140ms
atomicAdd to Direct to Global memory- 90ms

It would be so appreciated if you drop a line.

SK.

Here's simple codes:
#define WARP_WIDTH 16
#define W 256
#define ITER 1000000

///////////////AtomicAdd to Shared memory 'shd'//////////////////
__global__ void kernel_shdatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
__shared__ int shd[WARP_WIDTH];
shd[threadIdx.x] = in[j];

int i;
for(i=0;i<ITER;i++)
atomicAdd((int*)&(shd[threadIdx.x]), i );

out[j] =shd[threadIdx.x];
__syncthreads();
return;
}

///////////////AtomicAdd to global memory 'out'//////////////////
__global__ void kernel_glbatm(int* in, int* out)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;
int i;
for(i=0;i<ITER;i++)
atomicAdd((int*)&(out[j]), i);

__syncthreads();
return;
}

////////////////////////////////////////////////////////////////////
// kernel call

// to shared memory->global memory copy
kernel_shdatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_in, g_out);

// to global memory directly.
kernel_glbatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_ing, g_outg);

The time is measured by surrounding each of them including memory alloc/copy with cudaEvent..().
Dear all,



I wrote 2 kernels to see how much atomicAdd() to shared memory is faster than global memory.

Kernel is simple: Just keep add from i=0 .. ITER-1, under 16 threads in a block across 256/16 blocks.



The result I cannot understand is:

atomicAdd to Shared memory - 140ms

atomicAdd to Direct to Global memory- 90ms



It would be so appreciated if you drop a line.



SK.



Here's simple codes:

#define WARP_WIDTH 16

#define W 256

#define ITER 1000000



///////////////AtomicAdd to Shared memory 'shd'//////////////////

__global__ void kernel_shdatm(int* in, int* out)

{

int j = threadIdx.x + blockDim.x * blockIdx.x;

__shared__ int shd[WARP_WIDTH];

shd[threadIdx.x] = in[j];



int i;

for(i=0;i<ITER;i++)

atomicAdd((int*)&(shd[threadIdx.x]), i );



out[j] =shd[threadIdx.x];

__syncthreads();

return;

}



///////////////AtomicAdd to global memory 'out'//////////////////

__global__ void kernel_glbatm(int* in, int* out)

{

int j = threadIdx.x + blockDim.x * blockIdx.x;

int i;

for(i=0;i<ITER;i++)

atomicAdd((int*)&(out[j]), i);



__syncthreads();

return;

}



////////////////////////////////////////////////////////////////////

// kernel call



// to shared memory->global memory copy

kernel_shdatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_in, g_out);



// to global memory directly.

kernel_glbatm<<<W/WARP_WIDTH, WARP_WIDTH>>>(g_ing, g_outg);



The time is measured by surrounding each of them including memory alloc/copy with cudaEvent..().

#1
Posted 02/21/2012 10:09 PM   
The short answer is don't use shared memory atomics if you care about performance.
The short answer is don't use shared memory atomics if you care about performance.

#2
Posted 02/21/2012 10:40 PM   
Thanks Gregory for reply.

Yes, it's a rule of thumb "Avoid Atomics!"

The original intention of the question is on the comparison between Atomics to Global with Shared memory.I wanted to see the overhead to copy from/to global memory and shared memory will be relatively smaller if the cost atomics is so large for each thread. In the code 10000 times atomicAdd() would be costlier enough than one time copy between global/shared memory.

Am I looking into this situation legitimately as designed?

Thanks,

SK.

I just wanted to see the theory(?) :
http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/
in my example.
Thanks Gregory for reply.



Yes, it's a rule of thumb "Avoid Atomics!"



The original intention of the question is on the comparison between Atomics to Global with Shared memory.I wanted to see the overhead to copy from/to global memory and shared memory will be relatively smaller if the cost atomics is so large for each thread. In the code 10000 times atomicAdd() would be costlier enough than one time copy between global/shared memory.



Am I looking into this situation legitimately as designed?



Thanks,



SK.



I just wanted to see the theory(?) :

http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/


in my example.

#3
Posted 02/22/2012 01:24 AM   
Scroll To Top