__syncthreads throughput too low on Fermi!
The Programming Guide says that __syncthreads throughput is "16 operations per clock cycle for devices of compute capability 2.x" (section 5.4.3.)

Is that per SM? I'm getting at most 8 operations per clock cycle on a Fermi GT-430. This is the code:
[code]
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

#define CUDA_CHECK_ERROR( call) do { \
cudaError err = (call); \
if( cudaSuccess != err) { \
fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} } while (0)

__shared__ long long int cl[2];

__global__ void no_op()
{
cl[0] = clock64();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
__syncthreads();
if(threadIdx.x == 0) {
cl[1] = clock64();
for(int i = 0; i < 2; i++)
printf("%lld\n", cl[i]);
}
}

int main()
{
no_op<<<1, 32*32>>>();
CUDA_CHECK_ERROR(cudaGetLastError());
CUDA_CHECK_ERROR(cudaThreadSynchronize());
return 0;
}
[/code]

It prints:
15667992
15669252
Which is 1260 clock cycles for about 9500 __syncthreads; I.e. about 8/cycle.

EDIT: with full occupancy it goes up to about 11/cycle.
The Programming Guide says that __syncthreads throughput is "16 operations per clock cycle for devices of compute capability 2.x" (section 5.4.3.)



Is that per SM? I'm getting at most 8 operations per clock cycle on a Fermi GT-430. This is the code:



#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>



#define CUDA_CHECK_ERROR( call) do { \

cudaError err = (call); \

if( cudaSuccess != err) { \

fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \

__FILE__, __LINE__, cudaGetErrorString( err) ); \

exit(EXIT_FAILURE); \

} } while (0)



__shared__ long long int cl[2];



__global__ void no_op()

{

cl[0] = clock64();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

__syncthreads();

if(threadIdx.x == 0) {

cl[1] = clock64();

for(int i = 0; i < 2; i++)

printf("%lld\n", cl[i]);

}

}



int main()

{

no_op<<<1, 32*32>>>();

CUDA_CHECK_ERROR(cudaGetLastError());

CUDA_CHECK_ERROR(cudaThreadSynchronize());

return 0;

}




It prints:

15667992

15669252

Which is 1260 clock cycles for about 9500 __syncthreads; I.e. about 8/cycle.



EDIT: with full occupancy it goes up to about 11/cycle.

#1
Posted 05/07/2012 05:15 AM   
This discrepancy is due to the latency of __syncthreads, which adds up to the throughput cost.

Try running different number of warps, plot the timing and look at the slope.

When I benched __syncthreads on GTX480 I got 2 cycle cost per warp + constant overhead of ~40 cycles.
This discrepancy is due to the latency of __syncthreads, which adds up to the throughput cost.



Try running different number of warps, plot the timing and look at the slope.



When I benched __syncthreads on GTX480 I got 2 cycle cost per warp + constant overhead of ~40 cycles.

#2
Posted 05/07/2012 05:49 AM   
[quote name='vvolkov' date='06 May 2012 - 10:49 PM' timestamp='1336369776' post='1405124']
This discrepancy is due to the latency of __syncthreads, which adds up to the throughput cost.

Try running different number of warps, plot the timing and look at the slope.

When I benched __syncthreads on GTX480 I got 2 cycle cost per warp + constant overhead of ~40 cycles.
[/quote]
Thank you. That way it comes to exactly 16/cycle that was claimed in the guide. For Kepler, it's 128/cycle; very neat!
[quote name='vvolkov' date='06 May 2012 - 10:49 PM' timestamp='1336369776' post='1405124']

This discrepancy is due to the latency of __syncthreads, which adds up to the throughput cost.



Try running different number of warps, plot the timing and look at the slope.



When I benched __syncthreads on GTX480 I got 2 cycle cost per warp + constant overhead of ~40 cycles.



Thank you. That way it comes to exactly 16/cycle that was claimed in the guide. For Kepler, it's 128/cycle; very neat!

#3
Posted 05/07/2012 06:03 AM   
I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.
I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 05/07/2012 10:16 AM   
[quote name='tera' date='07 May 2012 - 03:16 AM' timestamp='1336385775' post='1405172']
I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.
[/quote]

me too, I had to use a more complicated contraption to measure it.

May be it is done to make it easier to use __syncthreads in conditional code. If you have different number of __syncthreads in if- and else-branches, you might want to add a few redundant __syncthreads to make it equal.
[quote name='tera' date='07 May 2012 - 03:16 AM' timestamp='1336385775' post='1405172']

I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.





me too, I had to use a more complicated contraption to measure it.



May be it is done to make it easier to use __syncthreads in conditional code. If you have different number of __syncthreads in if- and else-branches, you might want to add a few redundant __syncthreads to make it equal.

#5
Posted 05/07/2012 10:11 PM   
[quote name='tera' date='07 May 2012 - 03:16 AM' timestamp='1336385775' post='1405172']
I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.
[/quote]
heh... frankly, the right thing to do is to optimize them away. It appears that the new compiler is a bit immature(not too surprising.) I suspect it'll improve in the (near) future.
[quote name='tera' date='07 May 2012 - 03:16 AM' timestamp='1336385775' post='1405172']

I'm surprised to find cicc doesn't optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.



heh... frankly, the right thing to do is to optimize them away. It appears that the new compiler is a bit immature(not too surprising.) I suspect it'll improve in the (near) future.

#6
Posted 05/08/2012 11:48 PM   
Scroll To Top