For one of my other posts, I was suggested that I could use prefetching to fetch some data for a subsequent iteration from the global memory as a way of dealing with the memory latency.
However, I found that the code is quite slow compared to directly accessing the global memory and not using the shared memory.
Here is some code:
for (int iter=0; iter<totIter; iter++) {
int index2 = .... // Omitting the weird logic for calculating index2 for the shared memory
if (iter%2 == 0) {
sharedMem1[threadIdx.x] = sharedMem2[threadIdx.x]; // here making sure shared memories are consistent after the previous iteration update.
__syncthreads();
func_arg2 = sharedMem2[index2];
}
else {
sharedMem2[threadIdx.x] = sharedMem1[threadIdx.x];
__syncthreads();
func_arg2 = sharedMem1[index2];
}
if (threadIdx.x == blockDim.x-1) {
int gindex = .... // Again omitting the weird logic for calculating the gindex for the global memory
if (iter%2 == 0)
sharedMem1[...] = globalMem[gindex];
else
sharedMem2[...] = globalMem[gindex]; // the idea here is to alternate between sharedMem1 and sharedMem2 to prefetch
result += function(func_arg1, func_arg2); // arguments don't depend on sharedMem1 or sharedMem2 for the current iteration
}
else {
result += function(func_arg1, func_arg2);
}
}
Is there a real benefit here in using two shared memories to prefetch the data for a subsequent iteration?
What can I do here to avoid branch divergence especially where the last thread in the block does the prefetching into shared memory?