Question about prefetching

For one of my other posts, I was suggested that I could use prefetching to fetch some data for a subsequent iteration from the global memory as a way of dealing with the memory latency.

However, I found that the code is quite slow compared to directly accessing the global memory and not using the shared memory.

Here is some code:

for (int iter=0; iter<totIter; iter++) { 

int index2 = .... // Omitting the weird logic for calculating index2 for the shared memory 

     if (iter%2 == 0) { 

         sharedMem1[threadIdx.x] = sharedMem2[threadIdx.x]; // here making sure shared memories are consistent after the previous iteration update.

         __syncthreads(); 

         func_arg2 = sharedMem2[index2]; 

     }

else { 

         sharedMem2[threadIdx.x] = sharedMem1[threadIdx.x]; 

         __syncthreads(); 

         func_arg2 = sharedMem1[index2]; 

     } 

if (threadIdx.x == blockDim.x-1) { 

int gindex = .... // Again omitting the weird logic for calculating the gindex for the global memory

           if (iter%2 == 0) 

                sharedMem1[...] = globalMem[gindex];

           else

                sharedMem2[...] = globalMem[gindex]; // the idea here is to alternate between sharedMem1 and sharedMem2 to prefetch 

           result += function(func_arg1, func_arg2); // arguments don't depend on sharedMem1 or sharedMem2 for the current iteration

     } 

else  {

           result += function(func_arg1, func_arg2); 

     } 

}

Is there a real benefit here in using two shared memories to prefetch the data for a subsequent iteration?

What can I do here to avoid branch divergence especially where the last thread in the block does the prefetching into shared memory?

Your code is too complicated for the out-of-order execution logic to resolve it. I vaguely remember having seen somewhere that the scoreboard only has a single entry per shared memory bank. So once you have loaded data to shared memory, any read from the same bank will stall until the load operation has retired.

You logic shuffling data in shared memory also does not help with this - you need to leave the data untouched between prefetch and use, or program execution will stall when the shuffling is performed.

This means the best you can do is to fetch n iterations’ worth of data on every n iterations. This will not remove the latency altogether, but on average reduce it to 1/n (relying on multithreading to perform the averaging).

So I’d recommend a code structure like this (without any permuting of the indices, since I don’t know how that is meant to be done in you example):

__shared__ sharedMem[BLOCKSIZE + N_PREFETCH];

sharedMem[threadIdx.x] = globalMem[threadIdx.x];

for (int iter=0; iter<totIter; iter++) {

        if (iter % N_PREFETCH == 0) {

            __syncthreads();

            if ((threadIdx.x < N_PREFETCH) && (BLOCKSIZE + iter + threadIdx.x < totIter)) {

                 sharedMem[(BLOCKSIZE + iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)] = globalMem[BLOCKSIZE + iter + threadIdx.x];

            }

            __syncthreads();

        }

func_arg = sharedMem[(iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)];

        result += function(func_arg);

    }