Hello,
I am doing some simulation of dipolar system. The interactioj is 1/r^3, but it is very similar to the n-body problem presented here http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html We implemented the algorithm described above and it is performing very well. I am wondering if it is possible to improve the performance on Kepler devices by using shuffle functions. Below there is a short description of the program and the code we use.
In the N-body problem any given particle interacts with all particles.In total there are N*N interaction (minus the self interactions). Each thread calculates the forces for 1 particles. So in the kernel there is an extra loop. In each interaction of this loop the position of p particles are saved in shared memory and p x p interactions are calculated (figure 4 in the upper link).
Here is the code:
__device__ float3
tile_calculation(float4 myPosition, float3 accel)
{
int i;
extern __shared__ float4[] shPosition;
for (i = 0; i < blockDim.x; i++) {
accel = bodyBodyInteraction(myPosition, shPosition[i], accel);
}
return accel;
}
__global__ void
calculate_forces(void *devX, void *devA)
{
extern __shared__ float4[] shPosition;
float4 *globalX = (float4 *)devX;
float4 *globalA = (float4 *)devA;
float4 myPosition;
int i, tile;
float3 acc = {0.0f, 0.0f, 0.0f};
int gtid = blockIdx.x * blockDim.x + threadIdx.x;
myPosition = globalX[gtid];
for (i = 0, tile = 0; i < N; i += p, tile++) {
int idx = tile * blockDim.x + threadIdx.x;
shPosition[threadIdx.x] = globalX[idx];
__syncthreads();
acc = tile_calculation(myPosition, acc);
__syncthreads();
}
// Save the result in global memory for the integration step.
float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};
globalA[gtid] = acc4;
}
Would it be possible ot improve this code by using the warp-shuffle functions?
Cristian