Hello,
I have three nested for-loops,
for (i=0; i<N; i++) {
for (j=i+1; j<N; j++) {
for (k=j+1; k<N; k++) {
Operations involving data, data[i], data[j], data[k]
}
}
}
What is the best way to parallelize this in CUDA? For example, if I had two nested loops, I would unroll the outer loop across the threads and the inner loop can be done in one stretch or in tiles depending on whether I used the GPU global memory or shared memory on the GPU and the problem size.
For this example, we could do the same but probably that is not launching a lot of threads (as it is only along N). So we could have a 2D block and have each thread (ti, tj) do the k-loop. What about 3D blocks and 3D grids?
Any hints would be appreciated.