Question about prefetching
For one of my other posts, I was suggested that I could use prefetching to fetch some data for a subsequent iteration from the global memory as a way of dealing with the memory latency.
However, I found that the code is quite slow compared to directly accessing the global memory and not using the shared memory.

Here is some code:

[code]
for (int iter=0; iter<totIter; iter++) {

int index2 = .... // Omitting the weird logic for calculating index2 for the shared memory
if (iter%2 == 0) {
sharedMem1[threadIdx.x] = sharedMem2[threadIdx.x]; // here making sure shared memories are consistent after the previous iteration update.
__syncthreads();
func_arg2 = sharedMem2[index2];
}

else {
sharedMem2[threadIdx.x] = sharedMem1[threadIdx.x];
__syncthreads();
func_arg2 = sharedMem1[index2];
}

if (threadIdx.x == blockDim.x-1) {

int gindex = .... // Again omitting the weird logic for calculating the gindex for the global memory
if (iter%2 == 0)
sharedMem1[...] = globalMem[gindex];
else
sharedMem2[...] = globalMem[gindex]; // the idea here is to alternate between sharedMem1 and sharedMem2 to prefetch
result += function(func_arg1, func_arg2); // arguments don't depend on sharedMem1 or sharedMem2 for the current iteration
}

else {
result += function(func_arg1, func_arg2);
}

} [/code]

Is there a real benefit here in using two shared memories to prefetch the data for a subsequent iteration?

What can I do here to avoid branch divergence especially where the last thread in the block does the prefetching into shared memory?
For one of my other posts, I was suggested that I could use prefetching to fetch some data for a subsequent iteration from the global memory as a way of dealing with the memory latency.

However, I found that the code is quite slow compared to directly accessing the global memory and not using the shared memory.



Here is some code:





for (int iter=0; iter<totIter; iter++) {



int index2 = .... // Omitting the weird logic for calculating index2 for the shared memory

if (iter%2 == 0) {

sharedMem1[threadIdx.x] = sharedMem2[threadIdx.x]; // here making sure shared memories are consistent after the previous iteration update.

__syncthreads();

func_arg2 = sharedMem2[index2];

}



else {

sharedMem2[threadIdx.x] = sharedMem1[threadIdx.x];

__syncthreads();

func_arg2 = sharedMem1[index2];

}



if (threadIdx.x == blockDim.x-1) {



int gindex = .... // Again omitting the weird logic for calculating the gindex for the global memory

if (iter%2 == 0)

sharedMem1[...] = globalMem[gindex];

else

sharedMem2[...] = globalMem[gindex]; // the idea here is to alternate between sharedMem1 and sharedMem2 to prefetch

result += function(func_arg1, func_arg2); // arguments don't depend on sharedMem1 or sharedMem2 for the current iteration

}



else {

result += function(func_arg1, func_arg2);

}



}




Is there a real benefit here in using two shared memories to prefetch the data for a subsequent iteration?



What can I do here to avoid branch divergence especially where the last thread in the block does the prefetching into shared memory?

#1
Posted 04/26/2012 08:35 PM   
Your code is too complicated for the out-of-order execution logic to resolve it. I vaguely remember having seen somewhere that the scoreboard only has a single entry per shared memory bank. So once you have loaded data to shared memory, any read from the same bank will stall until the load operation has retired.

You logic shuffling data in shared memory also does not help with this - you need to leave the data untouched between prefetch and use, or program execution will stall when the shuffling is performed.

This means the best you can do is to fetch [i]n[/i] iterations' worth of data on every [i]n[/i] iterations. This will not remove the latency altogether, but on average reduce it to 1/[i]n[/i] (relying on multithreading to perform the averaging).

So I'd recommend a code structure like this (without any permuting of the indices, since I don't know how that is meant to be done in you example):

[code]
__shared__ sharedMem[BLOCKSIZE + N_PREFETCH];

sharedMem[threadIdx.x] = globalMem[threadIdx.x];

for (int iter=0; iter<totIter; iter++) {
if (iter % N_PREFETCH == 0) {
__syncthreads();
if ((threadIdx.x < N_PREFETCH) && (BLOCKSIZE + iter + threadIdx.x < totIter)) {
sharedMem[(BLOCKSIZE + iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)] = globalMem[BLOCKSIZE + iter + threadIdx.x];
}
__syncthreads();
}

func_arg = sharedMem[(iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)];
result += function(func_arg);
}
[/code]
Your code is too complicated for the out-of-order execution logic to resolve it. I vaguely remember having seen somewhere that the scoreboard only has a single entry per shared memory bank. So once you have loaded data to shared memory, any read from the same bank will stall until the load operation has retired.



You logic shuffling data in shared memory also does not help with this - you need to leave the data untouched between prefetch and use, or program execution will stall when the shuffling is performed.



This means the best you can do is to fetch n iterations' worth of data on every n iterations. This will not remove the latency altogether, but on average reduce it to 1/n (relying on multithreading to perform the averaging).



So I'd recommend a code structure like this (without any permuting of the indices, since I don't know how that is meant to be done in you example):





__shared__ sharedMem[BLOCKSIZE + N_PREFETCH];



sharedMem[threadIdx.x] = globalMem[threadIdx.x];



for (int iter=0; iter<totIter; iter++) {

if (iter % N_PREFETCH == 0) {

__syncthreads();

if ((threadIdx.x < N_PREFETCH) && (BLOCKSIZE + iter + threadIdx.x < totIter)) {

sharedMem[(BLOCKSIZE + iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)] = globalMem[BLOCKSIZE + iter + threadIdx.x];

}

__syncthreads();

}



func_arg = sharedMem[(iter + threadIdx.x) % (BLOCKSIZE + N_PREFETCH)];

result += function(func_arg);

}

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/26/2012 09:29 PM   
Scroll To Top