Branch Divergence

I understand the concept of branch divergence in CUDA applications, but I’m testing an application to see in practice the divergence, and nvprof is giving a number of branches and divergent branches that I’m failing to understand where those numbers came from.

So I have these two kernel blocks:

__global__ void mathKernel1(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if (tid % 2 == 0){
        ia = 100.0f;
        ib = 50.0;
        ia = pow(ia,3) * ib;
	}
    else{
        ib = 200.0f;
        ia = 15.0f;
        ib = pow(ib,3) + ia;

	}

    c[tid] = ia + ib;
}
__global__ void mathKernel2(float *c)
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float ia, ib;
    ia = ib = 0.0f;

    if ((tid / warpSize) % 2 == 0){
        ia = 100.0f;
        ib = 50.0;
        ia = pow(ia,3) + ib;
	}
    else{
        ib = 200.0f;
        ia = 15.0f;
        ib = pow(ib,3) + ia;
	}

    c[tid] = ia + ib;
}

Both kernels are being executed by 1 block with 64 threads on a GTX650.
Nvprof is telling me that the first kernel have 22 branches and 2 divergent branch. The second one have 12 branches and 0 divergent branch.
I was expecting 2 divergent branches, as there is 2 warps and Im dividing them by the thread identification, so there’s 2 non divergent branch. What I don’t understand is where those other 20 branches from kernel1 and 12 branches from kernel 2 came from.

The behavior that nvprof is reporting on here, is the behavior of the SASS, not your C source code. For a kernel as simple as this, you can probably connect the two, but in general it may be difficult to do.

Also, nvprof really only knows about “divergence” at the warp level. If two separate warps go down separate code paths, that is not measurably “divergent”.

Your second case has no divergence because this operation:

if ((tid / warpSize) % 2 == 0){

causes all threads in a warp to go down only one of the two possible paths. Either all threads in the warp go down the if path, or all threads in the warp go down the else path. Therefore there is no measured “divergence”.

So what I consider as a branch in a C code is not necessarily represented in SASS code. In C code what I see are two branches for each warp, the branch 1 that gets the lines 3~11 and line 18 too, and the branch 2 that is the else statement, lines 13~15.
Then in SASS code that part that just seems sequential to me is actually represented as many branches?

Is there a way to see the branch numbers of the C code?

There is no general rule on how branches at the HLL level will be represented at the machine language level. This applies to CUDA just as it applies to any commonly use CPU.

The CUDA tool chain often applies well-known compiler optimizations such as if-conversion that turns a branch body into predicated code, or it may employ select-type instructions which are the machine equivalent of C/C++ ternary operator to avoid branches. Conversely, a branch at HLL level with a compound branch condition may really result in multiple branch instructions based on C/C++ short-circuit evaluation of boolean ANDs and ORs. So you really would want inspect the machine code, called SASS in the GPU context. You can use cuobjdump --dump-sass to extract the machine code from an executable.

Branches on the GPU can only be divergent if the threads within the same warp execute physically different code paths (i.e. sequences of instruction addresses). All threads in a warp share a single program counter, and as long as the threads in a warp execute the same instruction sequence including branches, there is no divergence.