Branch Divergence
My code involves repeating a calculation for a certain number of iterations. I am using shared memory for the code, and for the first iteration, I have copied the data from the global memory to the shared memory. For this iteration, all threads read an element from the shared memory.

However, for each iteration beyond the first iteration, I change one of the elements in the shared memory (in a cyclic manner but it is always done by the same thread, last thread in the block in the example shown below) and set it to another value in the global memory (there are only loads associated with the global memory). The idea is to to reuse some of the already cached values in the shared memory instead of each thread reading from the global memory. So my present code has one thread (in the following example, the last thread in the block) updating its shared memory location with a value from the global memory.

For example, here is a simplified version to demonstrate the idea.

[code]for (iteration=1; iteration < totalIterations; iteration++) {

if (iteration == 1) {
sharedMem[threadIdx.x] = globalMem[globalIndex];
}

else {
if (threadIdx.x == blockDim.x-1) {
sharedMem[threadIdx.x] = globalMem[...]; // index is not the globalIndex in this case as it could be any value from the global memory (in this case, cyclic, for example, if there are 4 threads in the block, it could be 5, 6, 7, 8 and so on).
}
__syncthreads();
// Compute something with the shared memory values.
} [/code]

It is quite obvious that the code suffers from branch divergence (also shown by the CUDA Visual Profiler). Is there a way to avoid the branch divergence at the same time using the shared memory? Please let me know if there is any confusion and I can explain in further detail.
My code involves repeating a calculation for a certain number of iterations. I am using shared memory for the code, and for the first iteration, I have copied the data from the global memory to the shared memory. For this iteration, all threads read an element from the shared memory.



However, for each iteration beyond the first iteration, I change one of the elements in the shared memory (in a cyclic manner but it is always done by the same thread, last thread in the block in the example shown below) and set it to another value in the global memory (there are only loads associated with the global memory). The idea is to to reuse some of the already cached values in the shared memory instead of each thread reading from the global memory. So my present code has one thread (in the following example, the last thread in the block) updating its shared memory location with a value from the global memory.



For example, here is a simplified version to demonstrate the idea.



for (iteration=1; iteration < totalIterations; iteration++) { 



if (iteration == 1) {

sharedMem[threadIdx.x] = globalMem[globalIndex];

}



else {

if (threadIdx.x == blockDim.x-1) {

sharedMem[threadIdx.x] = globalMem[...]; // index is not the globalIndex in this case as it could be any value from the global memory (in this case, cyclic, for example, if there are 4 threads in the block, it could be 5, 6, 7, 8 and so on).

}

__syncthreads();

// Compute something with the shared memory values.

}




It is quite obvious that the code suffers from branch divergence (also shown by the CUDA Visual Profiler). Is there a way to avoid the branch divergence at the same time using the shared memory? Please let me know if there is any confusion and I can explain in further detail.

#1
Posted 04/24/2012 08:57 PM   
If you have some shared memory to spare, you could use all threads (or at least some more) to read data for a few iterations ahead. That not only makes better use of the global memory bandwidth, but also avoids the global memory latency for the next few iterations.
If you have some shared memory to spare, you could use all threads (or at least some more) to read data for a few iterations ahead. That not only makes better use of the global memory bandwidth, but also avoids the global memory latency for the next few iterations.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/25/2012 12:40 AM   
[quote name='tera' date='25 April 2012 - 12:40 AM' timestamp='1335314427' post='1400589']
If you have some shared memory to spare, you could use all threads (or at least some more) to read data for a few iterations ahead. That not only makes better use of the global memory bandwidth, but also avoids the global memory latency for the next few iterations.
[/quote]

Thanks for your reply. I was also thinking along the same lines of prefetching the data for a subsequent iteration under the else condition. I am just not really clear about the branch divergence. So in my example, will all the other threads be stalled until the global memory access under the if condition is finished by the last thread in the block? So the idea is to use the other threads which are waiting anyways instead do some useful work?
[quote name='tera' date='25 April 2012 - 12:40 AM' timestamp='1335314427' post='1400589']

If you have some shared memory to spare, you could use all threads (or at least some more) to read data for a few iterations ahead. That not only makes better use of the global memory bandwidth, but also avoids the global memory latency for the next few iterations.





Thanks for your reply. I was also thinking along the same lines of prefetching the data for a subsequent iteration under the else condition. I am just not really clear about the branch divergence. So in my example, will all the other threads be stalled until the global memory access under the if condition is finished by the last thread in the block? So the idea is to use the other threads which are waiting anyways instead do some useful work?

#3
Posted 04/25/2012 03:46 AM   
Yes, other threads keep running, until they eventually stall on a memory access as well. The number of threads running in parallel isn't large enough to fully hide global memory latency, so some parallelism has to be extracted from within each thread as well. The SMs use some simple form of scoreboarding-based out-of-order execution for this, which lets threads continue to run until they encounter an instruction where the datum from memory is actually used.

By loading data for several iterations at once, the global memory latency is encountered less often. It is thus more likely that enough other threads are available (not waiting on memory reads) so that memory latency can be fully hidden.
Yes, other threads keep running, until they eventually stall on a memory access as well. The number of threads running in parallel isn't large enough to fully hide global memory latency, so some parallelism has to be extracted from within each thread as well. The SMs use some simple form of scoreboarding-based out-of-order execution for this, which lets threads continue to run until they encounter an instruction where the datum from memory is actually used.



By loading data for several iterations at once, the global memory latency is encountered less often. It is thus more likely that enough other threads are available (not waiting on memory reads) so that memory latency can be fully hidden.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 04/25/2012 11:12 PM   
Scroll To Top