Sum reduction working in Fermi, Kepler and Maxwell

I’m using a Kepler architecture code to make a sum reduce to my arrays (From here: [Kepler architecture reductions][1]). And I wanted to expand it to a global code. In other words, I want that the code can run in every architecture.

I was studying and trying to apply the reduce6 code from the NVIDIA samples, but I had some problems, the results weren’t the same.

I post the both codes here:

__forceinline__ __device__ float warpReduceSum(float val)
    {
      for (int offset = warpSize/2; offset > 0; offset /= 2){
  			   val += __shfl_down(val, offset);
  		}

       //for (int i=1; i<warpSize; i*=2) val += __shfl_xor(val, i);
       return val;

    }

    __forceinline__ __device__ float blockReduceSum(float val) {

       // --- The shared memory is appointed to contain the warp reduction results. It is understood that the maximum number of threads per block will be
       //     1024, so that there will be at most 32 warps per each block.
       static __shared__ float shared[32];

       int lane    = threadIdx.x % warpSize;   // Thread index within the warp
       int wid     = threadIdx.x / warpSize;   // Warp ID

       // --- Performing warp reduction. Only the threads with 0 index within the warp have the "val" value set with the warp reduction result
       val = warpReduceSum(val);

       // --- Only the threads with 0 index within the warp write the warp result to shared memory
       if (lane==0){
			   shared[wid]=val;   // Write reduced value to shared memory
		}

       // --- Wait for all warp reductions
       __syncthreads();

       // --- There will be at most 1024 threads within a block and at most 1024 blocks within a grid. The partial sum is read from shared memory only
       //     the corresponding warp existed, otherwise the partial sum is set to zero.
       val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;

       // --- The first warp performs the final partial warp summation.
       if (wid==0) val = warpReduceSum(val);

       return val;
    }

    __global__ void deviceReduceKernel(float *in, float* out, long N)
    {

       float sum = 0.f;

       // --- Reduce multiple elements per thread.
       for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x){

			   sum += in[i];
		   }

        sum = blockReduceSum(sum);

       if (threadIdx.x==0){
			   out[blockIdx.x]=sum;
		}
    }





    __host__ float deviceReduce(float *in, long N) {
      float *device_out;
      gpuErrchk(cudaMalloc(&device_out, sizeof(float)*1024));
      gpuErrchk(cudaMemset(device_out, 0, sizeof(float)*1024));

      int threads= 512;
      int blocks = min((int(N) + threads - 1) / threads, 1024);

      //bool isPowTwo = isPow2(N);
      deviceReduceKernel<<<blocks, threads>>>(in, device_out, N);
      gpuErrchk(cudaDeviceSynchronize());

      //isPowTwo = isPow2(blocks);
      deviceReduceKernel<<<1, 1024>>>(device_out, device_out, blocks);
      gpuErrchk(cudaDeviceSynchronize());

      float sum = 0.0;
	  gpuErrchk(cudaMemcpy(&sum,device_out,sizeof(float),cudaMemcpyDeviceToHost));
      cudaFree(device_out);
	  return sum;
    }

And the reduce6 kernel:

template <unsigned int blockSize, bool nIsPow2>
    __global__ void reduce6(float *g_idata, float *g_odata, unsigned int n)
      {
       extern __shared__ float sdata[];

       // perform first level of reduction,
       // reading from global memory, writing to shared memory
       unsigned int tid = threadIdx.x;
       unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
       unsigned int gridSize = blockSize*2*gridDim.x;

       float mySum = 0.f;

       // we reduce multiple elements per thread.  The number is determined by the
       // number of active thread blocks (via gridDim).  More blocks will result
       // in a larger gridSize and therefore fewer elements per thread
       while (i < n)
       {
           mySum += g_idata[i];

        // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
        if (nIsPow2 || i + blockSize < n)
            mySum += g_idata[i+blockSize];

        i += gridSize;
       }

       // each thread puts its local sum into shared memory
       sdata[tid] = mySum;
       __syncthreads();


       // do reduction in shared mem
       if ((blockSize >= 512) && (tid < 256))
       {
           sdata[tid] = mySum = mySum + sdata[tid + 256];
       }

       __syncthreads();

       if ((blockSize >= 256) &&(tid < 128))
       {
               sdata[tid] = mySum = mySum + sdata[tid + 128];
       }

       __syncthreads();

       if ((blockSize >= 128) && (tid <  64))
       {
          sdata[tid] = mySum = mySum + sdata[tid +  64];
       }

       __syncthreads();

       #if (__CUDA_ARCH__ >= 300 )
        if ( tid < 32 )
        {
            // Fetch final intermediate sum from 2nd warp
            if (blockSize >=  64) mySum += sdata[tid + 32];
            // Reduce final warp using shuffle
            for (int offset = warpSize/2; offset > 0; offset /= 2) 
            {
                mySum += __shfl_down(mySum, offset);
            }
        }
        #else
        // fully unroll reduction within a single warp
        if ((blockSize >=  64) && (tid < 32))
        {
            sdata[tid] = mySum = mySum + sdata[tid + 32];
        }

        __syncthreads();

        if ((blockSize >=  32) && (tid < 16))
        {
            sdata[tid] = mySum = mySum + sdata[tid + 16];
        }

        __syncthreads();

        if ((blockSize >=  16) && (tid <  8))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  8];
        }

        __syncthreads();

        if ((blockSize >=   8) && (tid <  4))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  4];
        }

        __syncthreads();

        if ((blockSize >=   4) && (tid <  2))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  2];
        }

        __syncthreads();

        if ((blockSize >=   2) && ( tid <  1))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  1];
        }

       __syncthreads();
    #endif

        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = mySum;
    }

The question is, how can use a reduce6 kind of reduction in my code, because my code has an static shared memory[32] for the 32 threads of a warp and also doesn’t use de blocksize and isPow2 as reduce6 does. Another thing that I realized is that the reduce6 code has another way to do the blocks sum. I mean, in my code the final block sum is like:

deviceReduceKernel<<<1, 1024>>>(device_out, device_out, blocks);
      gpuErrchk(cudaDeviceSynchronize());

      float sum = 0.0;
      gpuErrchk(cudaMemcpy(&sum,device_out,sizeof(float),cudaMemcpyDeviceToHost));

While in reduction6 the block sum is like:

// sum partial block sums on GPU
            int s=numBlocks;
            int kernel = whichKernel;

            while (s > cpuFinalThreshold)
            {
                int threads = 0, blocks = 0;
                getNumBlocksAndThreads(kernel, s, maxBlocks, maxThreads, blocks, threads);

                reduce<T>(s, threads, blocks, kernel, d_odata, d_odata);

                if (kernel < 3)
                {
                    s = (s + threads - 1) / threads;
                }
                else
                {
                    s = (s + (threads*2-1)) / (threads*2);
                }
            }

            if (s > 1)
            {
                // copy result from device to host
                checkCudaErrors(cudaMemcpy(h_odata, d_odata, s * sizeof(T), cudaMemcpyDeviceToHost));

                for (int i=0; i < s; i++)
                {
                    gpu_result += h_odata[i];
                }
             }

Also, what is CPUFinalThreshold?
[1]: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

Without having looked at the details of your code, please keep in mind that floating-point addition is not associative, so adding the same data set in different order will lead to different results.

A reasonable working hypothesis here is therefore that the different results are due to different order of summation, without one method of summation being generally any more correct than the other.

Yes, but the main thing is that I’m just trying to port the reduce6 kind code (works in any architecture) to my code which I got from [1]: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/.

In your initial post you stated

which I understood to mean that the results from reduce6 and reduction6 weren’t identical. Now you state:

which seems like a contradiction to the earlier statement. Please clarify what you meant.

Sorry, you are right. That is one problem. But is because I’m trying to port reduce6 code and change all my past code with that. But I want that the last part of code, that is the last block sum remains the same. I ran the code that way, and yes, the results weren’t the same because obviously I wasn’t doing the same that the CUDA sample.

I meant have something like this:

template <bool nIsPow2>
    __global__ void deviceReduceKernel(float *g_idata, float *g_odata, unsigned int n)
      {
       extern __shared__ float sdata[];

       // perform first level of reduction,
       // reading from global memory, writing to shared memory
       unsigned int tid = threadIdx.x;
       unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
       unsigned int gridSize = blockSize*2*gridDim.x;

       float mySum = 0.f;

       // we reduce multiple elements per thread.  The number is determined by the
       // number of active thread blocks (via gridDim).  More blocks will result
       // in a larger gridSize and therefore fewer elements per thread
       while (i < n)
       {
           mySum += g_idata[i];

        // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
        if (nIsPow2 || i + blockSize < n)
            mySum += g_idata[i+blockSize];

        i += gridSize;
       }

       // each thread puts its local sum into shared memory
       sdata[tid] = mySum;
       __syncthreads();

// do reduction in shared mem
       if ((blockSize >= 512) && (tid < 256))
       {
           sdata[tid] = mySum = mySum + sdata[tid + 256];
       }

       __syncthreads();

       if ((blockSize >= 256) &&(tid < 128))
       {
               sdata[tid] = mySum = mySum + sdata[tid + 128];
       }

       __syncthreads();

       if ((blockSize >= 128) && (tid <  64))
       {
          sdata[tid] = mySum = mySum + sdata[tid +  64];
       }

       __syncthreads();

       #if (__CUDA_ARCH__ >= 300 )
        if ( tid < 32 )
        {
            // Fetch final intermediate sum from 2nd warp
            if (blockSize >=  64) mySum += sdata[tid + 32];
            // Reduce final warp using shuffle
            for (int offset = warpSize/2; offset > 0; offset /= 2) 
            {
                mySum += __shfl_down(mySum, offset);
            }
        }
        #else
        // fully unroll reduction within a single warp
        if ((blockSize >=  64) && (tid < 32))
        {
            sdata[tid] = mySum = mySum + sdata[tid + 32];
        }

        __syncthreads();

        if ((blockSize >=  32) && (tid < 16))
        {
            sdata[tid] = mySum = mySum + sdata[tid + 16];
        }

        __syncthreads();

        if ((blockSize >=  16) && (tid <  8))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  8];
        }

        __syncthreads();

        if ((blockSize >=   8) && (tid <  4))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  4];
        }

        __syncthreads();

        if ((blockSize >=   4) && (tid <  2))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  2];
        }

        __syncthreads();

        if ((blockSize >=   2) && ( tid <  1))
        {
            sdata[tid] = mySum = mySum + sdata[tid +  1];
        }

       __syncthreads();
    #endif

        // write result for this block to global mem
        if (tid == 0) g_odata[blockIdx.x] = mySum;
    }

 __host__ float deviceReduce(float *in, long N) {
      float *device_out;
      gpuErrchk(cudaMalloc(&device_out, sizeof(float)*1024));
      gpuErrchk(cudaMemset(device_out, 0, sizeof(float)*1024));

      int threads= 512;
      int blocks = min((int(N) + threads - 1) / threads, 1024);

      //bool isPowTwo = isPow2(N);
      deviceReduceKernel<<<blocks, threads>>>(in, device_out, N);
      gpuErrchk(cudaDeviceSynchronize());

      //isPowTwo = isPow2(blocks);
      deviceReduceKernel<<<1, 1024>>>(device_out, device_out, blocks);
      gpuErrchk(cudaDeviceSynchronize());

      float sum = 0.0;
	  gpuErrchk(cudaMemcpy(&sum,device_out,sizeof(float),cudaMemcpyDeviceToHost));
      cudaFree(device_out);
	  return sum;
    }

Where blocksize wouldn’t be the blockDim.x?

Doesnt the sum reduction code to which I posted the link (Jimmy Pettersson’s code) work across multiple architectures?

I have yet to see an implementation which beats that approach, and that code was posted 3 years ago I believe.

Maybe it does. But I didn’t understand what is the variable NbXGroups or tail and start_adr. And also why he use threads in a two dimensional way…

Ok, the way that code works is as follows;

Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.

So each thread block in the first launch will cover some multiple of the value in ;

const int blockSize1 = 16384;

but that means that if the array size is not a multiple of that value there will be ‘extra’ values which need to be considered, and that is how the ‘tail’ value is used.

So basically each thread block launched ends up with a sub-sum value for all the elements that block examined which is saved in a global array(of size one value per thread block launched).

That is what the first launch accomplishes, and then there is the second launch.

The second launch will then examine the ‘remainder’ values and sub them within the block. Once that is done then each thread in that second launch will go through a fraction of the values from the block sums saved during the first launch and sum/reduce with that final ‘dynamic’ thread block.

When that is all synchronized the final sum is saved to memory in the first value in the block-sum array by thread #0. That answer is copied back to the host and done.

if(threadIdx.x == 0)
		out[blockIdx.x] = smem[0][threadIdx.x]; // out[0] == ans

The reason this implementation tends to work better than the ‘canonical’ implementation is that (in my experience) GPUs tend to like to be ‘oversubscribed’ when it comes to memory operations, as long as it maintains a good level of occupancy and the operations are made in a coalesced fashion. So rather than trying to figure out the number of SMs on the GPU, and the threads possible per SM, it just floods the device with small groups of work.

The code is commented rather well so I think if you look through it you will be able to follow.

But then according to

Based on the size of the array with values to be summed, a determination is made regarding the number of strided values (elements in the input array) each thread in a block will load. In this case for Kepler/Maxwell 256 threads, 64 in the x dimension and 4 in the y dimension, so NbxGroups refers the number of x thread blocks of size 64.

The code would be static and won’t work on Fermi?

It will still work on Fermi I believe, but I think for that arch the optimal number of threads is 64 rather than 256. You probably could change NbxGroups to the value of 1 and the code will still work (though have not tried myself).

Why blockSize is = 16384?