My code involves repeating a calculation for a certain number of iterations. I am using shared memory for the code, and for the first iteration, I have copied the data from the global memory to the shared memory. For this iteration, all threads read an element from the shared memory.
However, for each iteration beyond the first iteration, I change one of the elements in the shared memory (in a cyclic manner but it is always done by the same thread, last thread in the block in the example shown below) and set it to another value in the global memory (there are only loads associated with the global memory). The idea is to to reuse some of the already cached values in the shared memory instead of each thread reading from the global memory. So my present code has one thread (in the following example, the last thread in the block) updating its shared memory location with a value from the global memory.
For example, here is a simplified version to demonstrate the idea.
for (iteration=1; iteration < totalIterations; iteration++) {
if (iteration == 1) {
sharedMem[threadIdx.x] = globalMem[globalIndex];
}
else {
if (threadIdx.x == blockDim.x-1) {
sharedMem[threadIdx.x] = globalMem[...]; // index is not the globalIndex in this case as it could be any value from the global memory (in this case, cyclic, for example, if there are 4 threads in the block, it could be 5, 6, 7, 8 and so on).
}
__syncthreads();
// Compute something with the shared memory values.
}
It is quite obvious that the code suffers from branch divergence (also shown by the CUDA Visual Profiler). Is there a way to avoid the branch divergence at the same time using the shared memory? Please let me know if there is any confusion and I can explain in further detail.