Branch Divergence

My code involves repeating a calculation for a certain number of iterations. I am using shared memory for the code, and for the first iteration, I have copied the data from the global memory to the shared memory. For this iteration, all threads read an element from the shared memory.

However, for each iteration beyond the first iteration, I change one of the elements in the shared memory (in a cyclic manner but it is always done by the same thread, last thread in the block in the example shown below) and set it to another value in the global memory (there are only loads associated with the global memory). The idea is to to reuse some of the already cached values in the shared memory instead of each thread reading from the global memory. So my present code has one thread (in the following example, the last thread in the block) updating its shared memory location with a value from the global memory.

For example, here is a simplified version to demonstrate the idea.

for (iteration=1; iteration < totalIterations; iteration++) { 

if (iteration == 1) { 

       sharedMem[threadIdx.x] = globalMem[globalIndex];

} 

else { 

     if (threadIdx.x == blockDim.x-1) { 

      sharedMem[threadIdx.x] = globalMem[...]; // index is not the globalIndex in this case as it could be any value from the global memory (in this case, cyclic, for example, if there are 4 threads in the block, it could be 5, 6, 7, 8 and so on).  

} 

   __syncthreads(); 

  // Compute something with the shared memory values. 

}

It is quite obvious that the code suffers from branch divergence (also shown by the CUDA Visual Profiler). Is there a way to avoid the branch divergence at the same time using the shared memory? Please let me know if there is any confusion and I can explain in further detail.

If you have some shared memory to spare, you could use all threads (or at least some more) to read data for a few iterations ahead. That not only makes better use of the global memory bandwidth, but also avoids the global memory latency for the next few iterations.

Thanks for your reply. I was also thinking along the same lines of prefetching the data for a subsequent iteration under the else condition. I am just not really clear about the branch divergence. So in my example, will all the other threads be stalled until the global memory access under the if condition is finished by the last thread in the block? So the idea is to use the other threads which are waiting anyways instead do some useful work?

Yes, other threads keep running, until they eventually stall on a memory access as well. The number of threads running in parallel isn’t large enough to fully hide global memory latency, so some parallelism has to be extracted from within each thread as well. The SMs use some simple form of scoreboarding-based out-of-order execution for this, which lets threads continue to run until they encounter an instruction where the datum from memory is actually used.

By loading data for several iterations at once, the global memory latency is encountered less often. It is thus more likely that enough other threads are available (not waiting on memory reads) so that memory latency can be fully hidden.