Is it possible to improve the performance of N-body interaction program with warp shuffle?

Hello,

I am doing some simulation of dipolar system. The interactioj is 1/r^3, but it is very similar to the n-body problem presented here http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html We implemented the algorithm described above and it is performing very well. I am wondering if it is possible to improve the performance on Kepler devices by using shuffle functions. Below there is a short description of the program and the code we use.

In the N-body problem any given particle interacts with all particles.In total there are N*N interaction (minus the self interactions). Each thread calculates the forces for 1 particles. So in the kernel there is an extra loop. In each interaction of this loop the position of p particles are saved in shared memory and p x p interactions are calculated (figure 4 in the upper link).

Here is the code:

__device__ float3  
    tile_calculation(float4 myPosition, float3 accel)  
    {  
      int i;  
      extern __shared__ float4[] shPosition;  
      for (i = 0; i < blockDim.x; i++) {  
        accel = bodyBodyInteraction(myPosition, shPosition[i], accel);  
      }  
      return accel;  
    }  

       __global__ void  
    calculate_forces(void *devX, void *devA)  
    {  
      extern __shared__ float4[] shPosition;  
      float4 *globalX = (float4 *)devX;  
      float4 *globalA = (float4 *)devA;  
      float4 myPosition;  
      int i, tile;  
      float3 acc = {0.0f, 0.0f, 0.0f};  
      int gtid = blockIdx.x * blockDim.x + threadIdx.x;  
      myPosition = globalX[gtid];  
      for (i = 0, tile = 0; i < N; i += p, tile++) {  
        int idx = tile * blockDim.x + threadIdx.x;  
        shPosition[threadIdx.x] = globalX[idx];  
        __syncthreads();  
        acc = tile_calculation(myPosition, acc);  
        __syncthreads();  
      }  
      // Save the result in global memory for the integration step.  
       float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};  
      globalA[gtid] = acc4;  
    }

Would it be possible ot improve this code by using the warp-shuffle functions?

Cristian

Potentially, but don’t expect a huge improvement (if at all). Within the shared memory tile you can do another level of warp-sized tiling and then use shfl to rotate the positions within the warp. So after one shmem load and 313 shfl()s you’ve summed all 3232 interactions between 32 myPositions and 32 shPositions.

Hope that is understandable, unfortunately I don’t have much time to explain.

Whether 3 shfls are faster than a float4 shmem load you’ll have to benchmark. cudaFuncSetSharedMemConfig(cudaSharedMemBankSizeEightByte) might shift the balance to favour the shmem loads though.

Hello,

Thanks for the reply. For my problem I am using a tile size of 64. I could reduce the size of the tile to 32 this would eliminate completely the use of shared memory. This would reduce the maximum possible occupancy to a quarter, because there would be only 16 (blocks)*32 (threads/block)= 512 active threads per MP.

Yes, that would be a bad move for occupancy, main memory bandwidth and register bank granularity reasons. Quite the opposite, why can’t you go to larger tile sizes?

I’d rather suggest using a 2-tier hierarchy of tiles. Large tiles in shared memory (maybe as large as 512, and then smaller size 32 tiles for use with shuffle, iff that turns out to improve speed. You could also try producing more than one output per thread.

N-body has been well optimized and AFAIR Nvidia has written a whitepaper detailing all optimization steps (although not necessarily updater for Kepler). I’m sure there must be a wealth of information available through a Google search?

The code is taken from the old GPU Gems 3. I have not found any newer articles on this. I have implemented this for a 1/r^3 potential in 2D, but I am excluding the self-interactions with an if statement (this should not be a problem since it does nothing). I tried different size of the tiles and used launch bounds because the register usage is too high and the optimal size of the tile was 64. Here is my own code:

typedef float dat;		/* datatype for scalar values */
typedef float2 dat2;		/* datatype for 2D values */

__device__ double2 dipolarInteraction(dat2 bi, dat2 bj, double2 ai,dat rcut,dat vpcutbd,dat lx,dat ly, const dat mi, const dat mj, const dat invlx, const dat invly,
				      const dat delrxA, const dat delrxB)  
{
	    dat delrx;
	    delrx=(mj==1.0?delrxA:delrxB);
	
	    dat deltax=bi.x-bj.x;
            dat deltay=bi.y-bj.y; 

            deltax=(deltax>lx*0.5f?deltax-lx:deltax);
            deltax=(deltax<-lx*0.5f?deltax+lx:deltax);
            deltay=(deltay>ly*0.5f?deltay-ly:deltay);
            deltay=(deltay<-ly*0.5f?deltay+ly:deltay);
	    
            dat rijsq=deltax*deltax+deltay*deltay;
            if(rijsq<rcut*rcut)
            {
		  
                  dat rij=rsqrt(rijsq)*(1.0f/(rijsq*rijsq)-vpcutbd);
		  rij *= mi*mj;
                  ai.x+=rij*deltax;
                  ai.y+=rij*deltay;
            }
  return ai; 
}  

__global__ void
__launch_bounds__(tpb,minBlocksPerMultiprocessor)
forces_calculation_tiles(curandState *devStates,dat2 *position, dat2 *dR, dat2 *devbin,const int Np,
                            const dat lx,const dat ly,const dat vpcutbd,const dat rcut,
                            const dat dt,const dat ggamma,const dat sqtwodt, const dat sqrtDB,const int devGO, const dat invlx, const dat invly, const dat DB, const dat shear_rate,
			    const dat delrxA, const dat delrxB)
{
  int ip = threadIdx.x + blockIdx.x * tpb;
  
  dat gamma_i;
  dat sqrtD;
  double2 myacc;
  myacc.x=0.0;
  myacc.y=0.0;
  dat2 myposip=position[ip];
  curandState localState = devStates[ip];    
  __shared__ dat2 shposition[tpb];
  __shared__ dat2 shneibcellbin[tpb];
//#pragma unroll  
  for (int tile = 0; tile < bpg;tile++)
  {
      int jp = tile *tpb + threadIdx.x; 
      shposition[threadIdx.x] = position[jp]; 
      shneibcellbin[threadIdx.x] = devbin[jp];
      __syncthreads(); 
//#pragma unroll
      for(int iloc=0;iloc<tpb;iloc++)
      {     
           jp=iloc+tpb*tile;
     	   if(ip!=jp)
                {
            myacc=dipolarInteraction(myposip,shposition[iloc],myacc,rcut,vpcutbd,lx,ly,mybin.x,shneibcellbin[iloc].x,invlx,invly,delrxA,delrxB); 
                }
       }                 
       __syncthreads();
  }
  
     dat2 rndnormal=curand_normal2(&localState);			/* get new random state */
     
     gamma_i = 1.0;	/* !!! homogeneous test case !!! */

dat deltax =3.0f*ggamma*dt*myacc.x + sqrtD*sqtwodt*rndnormal.x ;
     dat deltay =3.0f*ggamma*dt*myacc.y + sqrtD*sqtwodt*rndnormal.y;
     
     myposip.x = myposip.x + deltax;
     myposip.y = myposip.y + deltay;
     
     /* standard PBCs */
     
      myposip.x=myposip.x-lx*round((double)myposip.x/(double)lx);	/* replace particle with image particle for convenience */
//      myposip.x=(myposip.x<-0.5*lx?myposip.x+lx:myposip.x);	/* safety inline */ 
//      myposip.x=(myposip.x>=0.5*lx?myposip.x-lx:myposip.x);	/* safety inline */
      myposip.y=myposip.y-ly*round((double)myposip.y/(double)ly); 
//      myposip.y=(myposip.y<-0.5*ly?myposip.y+ly:myposip.y);	/* safety inline */	
//      myposip.y=(myposip.y>=0.5*ly?myposip.y-ly:myposip.y);	/* safety inline */

     position[ip] = myposip;		/* update position on device */
     devStates[ip] = localState; 
}

I will have to do etra checks. We are having relatively small systems and we are using 10-30 streams to get more statistics and keep the whole gpu occupied.