3D Block and Grid

Hello,

I have three nested for-loops,

for (i=0; i<N; i++) { 

   for (j=i+1; j<N; j++) { 

         for (k=j+1; k<N; k++) { 

                Operations involving data, data[i], data[j], data[k] 

}

}

}

What is the best way to parallelize this in CUDA? For example, if I had two nested loops, I would unroll the outer loop across the threads and the inner loop can be done in one stretch or in tiles depending on whether I used the GPU global memory or shared memory on the GPU and the problem size.

For this example, we could do the same but probably that is not launching a lot of threads (as it is only along N). So we could have a 2D block and have each thread (ti, tj) do the k-loop. What about 3D blocks and 3D grids?

Any hints would be appreciated.

Hello,

If Operations involving data[i], data[j], data[k] do not depend on some other data[ip], data[jp], data[kp], you can still submit NNN threads. You can submit 3D blocks and 3D grids of blocks, so you have lots of possibilities to combine them to get the i,j,k indices. In the kernel you can use an if statement because if j<=i it does nothing so there is no penalty for that. The way to do it depends a lot on your operations.