__syncthreads throughput too low on Fermi!

The Programming Guide says that __syncthreads throughput is “16 operations per clock cycle for devices of compute capability 2.x” (section 5.4.3.)

Is that per SM? I’m getting at most 8 operations per clock cycle on a Fermi GT-430. This is the code:

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#define CUDA_CHECK_ERROR( call) do {                                         \

    cudaError err = (call);                                                  \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

__shared__ long long int cl[2];

__global__ void no_op()

{

    cl[0] = clock64();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    __syncthreads();

    if(threadIdx.x == 0) {

        cl[1] = clock64();

        for(int i = 0; i < 2; i++)

            printf("%lld\n", cl[i]);

    }

}

int main()

{

    no_op<<<1, 32*32>>>();

    CUDA_CHECK_ERROR(cudaGetLastError());

    CUDA_CHECK_ERROR(cudaThreadSynchronize());

    return 0;

}

It prints:

15667992

15669252

Which is 1260 clock cycles for about 9500 __syncthreads; I.e. about 8/cycle.

EDIT: with full occupancy it goes up to about 11/cycle.

This discrepancy is due to the latency of __syncthreads, which adds up to the throughput cost.

Try running different number of warps, plot the timing and look at the slope.

When I benched __syncthreads on GTX480 I got 2 cycle cost per warp + constant overhead of ~40 cycles.

Thank you. That way it comes to exactly 16/cycle that was claimed in the guide. For Kepler, it’s 128/cycle; very neat!

I’m surprised to find cicc doesn’t optimize away multiple back-to-back syncthreads() like nvopencc did. Well I guess there are more worthwhile optimizations than that.

me too, I had to use a more complicated contraption to measure it.

May be it is done to make it easier to use __syncthreads in conditional code. If you have different number of __syncthreads in if- and else-branches, you might want to add a few redundant __syncthreads to make it equal.

heh… frankly, the right thing to do is to optimize them away. It appears that the new compiler is a bit immature(not too surprising.) I suspect it’ll improve in the (near) future.