Understanding and adjusting Mark Harris's array reduction

I have a code that is originally CPU-serial and it involves the sum of all elements of the array to be used as a normalization factor. Breaking down the problem results in 3 or 4 kernel functions, one of them is this sum which can be solved by the array reduction approach.

In generic terms, the function that calls these kernels is defined as (error checking omitted for easier reading):

// Declare and allocate needed space
float *array_in, *array_out;

cudaMallocManaged(&array_in, array_len * sizeof(float));
cudaMallocManaged(&array_out, array_len * sizeof(float));

// Assume array_in is filled here

// Call the kernels
kernel_1 <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_out, array_len);
cudaDeviceSynchronize();
reduction <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_len);
cudaDeviceSynchronize();
kernel_2 <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_out, array_len);
cudaDeviceSynchronize();

// Do something else with data and then free the memory
cudaFree(array_in);
cudaFree(array_out);

I will need the array reduction result to call kernel_2(). I am now working on getting the reduction into my program as it is already a tested and proven code and no need to reinvent the wheel. The document I used as base (copy/paste, if you will), is: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Slide 35 has the final (7th) kernel implementation, which I paste here just in case we need to reference to specific lines of the code:

==== SLIDE 35 of CUDA REDUCTION BY MARK HARRIS ====

template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n)
    {
    sdata[tid] += g_idata[i] + g_idata[i+blockSize];
    i += gridSize;
    }
__syncthreads();

if (blockSize >= 512)
    {
    if (tid < 256) { sdata[tid] += sdata[tid + 256]; }
    __syncthreads();
    }
if (blockSize >= 256)
    {
    if (tid < 128) { sdata[tid] += sdata[tid + 128]; }
    __syncthreads();
    }
if (blockSize >= 128)
    {
    if (tid <  64) { sdata[tid] += sdata[tid +  64]; }
    __syncthreads();
    }
if (tid < 32)
    {
    if (blockSize >=  64) sdata[tid] += sdata[tid + 32];
    if (blockSize >=  32) sdata[tid] += sdata[tid + 16];
    if (blockSize >=  16) sdata[tid] += sdata[tid +  8];
    if (blockSize >=   8) sdata[tid] += sdata[tid +  4];
    if (blockSize >=   4) sdata[tid] += sdata[tid +  2];
    if (blockSize >=   2) sdata[tid] += sdata[tid +  1];
    }
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Then I looked at the CUDA sample code reduction.h/.cpp which comes in the installation, just to see the context needed for this code to run, and noticed that the sample code isn’t quite the same implementation in this referred presentation, correct? The first difference I see is, while Mark Harris aims for a generic implementation, the CUDA sample code considers array sizes being powers of 2. In my case I will be using arrays like 2^32 or 2^33, but not necessarily always powers of 2. So I will stick to Mark’s code as this requirement is not implied.
In this case, can you enlighten be on a few aspects as it is not really a drop-in function?

  • Where exactly is sdata coming from? What should its length be if we reduce an array of length N?
  • Is sdata storing intermediate sums for the current kernel run? If so, what is g_odata storing in the end?
  • In the template function definition, shouldn’t be an argument along with the other ones?
  • How should the reduction kernel calls look like, considering that an array of 2^30 elements will require more calls than, say, 4096?

I appreciate any guidance you can provide.

sdata is an array created by a dynamic shared memory allocation (google CUDA dynamic shared memory, or just read the programming guide). It lives in shared memory. The size of the allocation in bytes (since it is dynamic as opposed to static) is given in the kernel launch: it is the 3rd kernel launch config parameter. Study the reduction sample code that you’ve already mentioned, and you will see this. The length (i.e. size of the allocation) has nothing to do with the array length N, but is dependent on the block size (number of threads per block).

They both are storing intermediate sums. Initially the data is stored in shared memory, as intermediate sums. The reduction eventually produces a single sum for all the data scooped up by the block (so, one partial sum for each block, eventually/finally). These final partial sums (considering the whole array, and the results of all the blocks) is stored in g_odata which is a pointer to global memory. This is a good thing, because the partial sum(s) in shared memory do not last past the retiring of the block they are associated with.

Do you mean a parameter? A parameter is part of a function definition, an argument is what you pass when you are actually calling that function:
https://stackoverflow.com/questions/156767/whats-the-difference-between-an-argument-and-a-parameter
For templated functions, with respect to the definition/body of the function, the template parameter(s) is a parameter, already. There is no particular reason I can think of that it should or needs to be replicated in the ordinary function parameter list. Maybe you need to study templating further.

If you actually meant argument, you will find this argument in the template argument list of any templated function call, unless the argument is inferrable from the ordinary function arguments.

You probably need to study the materials you have already referenced, further. One of the evolutions that Mark Harris goes thru is to implement a grid-stride loop:
https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
The reduction6 kernel you have excerpted includes a grid-stride loop. For a fixed set of blocks (say, 1024), it can handle arbitrary size array N. Therefore, it’s easily possible to realize a full reduction that requires only two kernel launches, independent of the size of the block, the number of blocks, and/or the length of the array N. Only two calls are required to reduce any arbitrary array to a single value. The second call will only require (and should only have) 1 block. The mechanics are already demonstrated in the cuda reduction sample code you already referenced.

As an aside, you mentioned array sizes of 2^32 to 2^33. An int array of 2^32 is 16GB. Such arrays would need GPUs with the highest memory sizes currently available (16, 24, 32GB).

Good evening, txbob, and thanks for the detailed explanation.

With this information and the highlights in questions 1, 2 and 4, I will read through the material again and get my hands dirty. You are right about question 3, as I mostly use but don’t write template functions, so it was more my ignorance on the subject than a CUDA-related question.

I have access to the big mamma cards at work but not to a development environment, so I have to work at home with a 1080Ti and make code that scales to bigger capacity.

But I will work on the reduction in the following days and update here as needed.
Thanks a lot for your wise advices.

I’ve been working on dropping the reduction into my program and, though the thing compiles, I still don’t have fully understood a few things, that are follow-up questions on the previous ones.

For example, the allocation of shared memory with the 3rd argument of the kernel launch, in your previous explanation it says it is the size of the allocation and not the length of the shared array, dependent on the block size. If I declare the shared array as:

extern __shared__ float s_array[];

In slide 27, where Mark Harris shows the switch with multiple kernel launches, his 3rd argument is just “smemSize”. I thought I should launch the kernel with <<< gridSize, blockSize, blockSize * sizeof(float) >>> , as my shared array is of type float and it is not length, but size of allocation. Then I came across Robert Crovella’s answer in https://stackoverflow.com/questions/24419822/efficiently-initializing-shared-memory-array-in-cuda?rq=1, and his declarations are:

#define SSIZE 2592
__shared__ float TMshared[SSIZE];

Hmmm… this looks length to me, now I am confused, so please correct me if I’m wrong. When shared memory is static, it is a normal array declaration. But if it is dynamic, then the allocation is “blockSize * sizeof(type)” as 3rd argument (respecting the 48KB for most cards)?

And as for “g_odata” size, since I have an output array that is as big as the input, I’m pretty sure it has the space to hold intermediate results. However, at the end of the reduction we have:

if (tid == 0) g_odata[blockIdx.x] = sdata[0];

I thought “g_odata” was a 1-element array to hold the final sum while s_data held all the blocks sums, but when I see the index of “g_odata”, now I’m not sure I understood it. And this shuffle-down reduction method, https://devblogs.nvidia.com/faster-parallel-reductions-kepler/, it doesn’t seem to get much love.

That’s correct. For a static allocation, it follows ordinary C programming rules. For dynamic allocation, the size specified via the 3rd kernel config parameter is a size in bytes.

No, that’s not it. I already stated explicitly that they both hold intermediate sums.

The shared memory is an area that the block works in to do various operations. When the block is all done, and has reduced everything to a single number for the data that the block touches, then it stores that single number in one of the locations in g_odata. The reduction normally consists of a kernel launch of a number of blocks. Each block produces a separate intermediate, partial sum. That partial sum, per block, gets stored in g_odata.

You then launch the kernel again, with a single block, to collect all the partial sums together, to combine them into one single number, the final sum, and store that also.

txbob, apologies if I didn’t quite get the message before. I was unsure on how to properly launch the kernel, and just found this topic: https://stackoverflow.com/questions/18023287/cuda-how-can-i-run-the-parallel-reduction-code-for-summation-that-is-described

After modifying the reduction to work with floats instead of integers, the function that wraps the calls to the kernel is:

const unsigned int grid_Size = 256, block_Size = 128;
cuda_reduction < block_Size > <<< grid_Size, block_Size, block_Size * sizeof(float) >>> (array_in, array_out, array_length);
cuda_reduction < 1 > <<< 1, block_Size, block_Size * sizeof(float) >>> (array_out, array_out, grid_Size);
cudaDeviceSynchronize();

The second call considers just 1 block (as per your previous explanation, only 2 calls are needed for any array length), and the input array is now the output, which holds partial sums from the previous run, looping up to grid_Size and not array_length anymore.
However, the solution in the referred thread has the input and output arrays order swapped in the second call of the kernel:

reduce0<<<totalBlocks, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(input, output, size);
reduce0<<<1, threadsPerBlock, threadsPerBlock*sizeof(int)>>>(output, input, totalBlocks);

If I do it, then some elements of the input array are modified, so I assume it was a typo, thus I keep array_out as input and output in the second call. In the end I get an output array with as many partial sums as the number of blocks (which is correct), but none of the elements is the final reduction.
As a QC I load the input array into Excel and do a SUM() on all elements. I don’t find the result anywhere in the output array after the second kernel call. I also had to add a cudaDeviceSynchronize() in the end or the program crashes if I immediately try to do anything with the arrays after the second kernel call. Do you have any suggestion on what could be wrong here? I’m definitely close to getting it to work properly.

The cudaDeviceSynchronize() needed after the last kernel call is because you are using managed memory. That is expected.

If you can condense your test case into 50 lines or less, just post a complete test case. I can’t tell what you are actually running vs. what you are extracting from an example. Not sure why you have gone back to reduce0, I think it makes sense to use reduce7.

Don’t really know what you are doing.

I am using reduce7, except that I changed the input arrays and sdata to be floats instead of integers. Then I was unsure of how to properly launch the kernel, so I found this SO post. It discusses reduce0, but all I took from it was those kernel launch lines.

A working test case with the reduction running on floats and printing the first grid_Size elements of the output array is:

#include <iostream>
#include <random>

template <unsigned int blockSize>
__global__ void cuda_reduction(float *array_in, float *reduct, size_t array_len)
    {
    extern __shared__ float sdata[];
    size_t  tid        = threadIdx.x,
            gridSize   = blockSize * 2 * gridDim.x,
            i          = blockIdx.x * (blockSize * 2) + tid;
    sdata[tid] = 0;
    while (i < array_len)
        { sdata[tid] += array_in[i] + array_in[i + blockSize];
        i += gridSize; }
    __syncthreads();
    if (blockSize >= 512)
        { if (tid < 256) sdata[tid] += sdata[tid + 256]; __syncthreads(); }
    if (blockSize >= 256)
        { if (tid < 128) sdata[tid] += sdata[tid + 128]; __syncthreads(); }
    if (blockSize >= 128)
        { if (tid <  64) sdata[tid] += sdata[tid + 64];	__syncthreads(); }
    if (tid < 32)
        { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
          if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
          if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
          if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];
          if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];
          if (blockSize >= 2)  sdata[tid] += sdata[tid + 1]; }
    if (tid == 0) reduct[blockIdx.x] = sdata[0];
    }

int main(void)
    {
    float *array_in, *array_out;
    size_t array_length = 100000;
    const unsigned int grid_Size = 256, block_Size = 128;

    cudaMallocManaged(&array_in, array_length * sizeof(float));
    cudaMallocManaged(&array_out, array_length * sizeof(float));
    // C++ random number generator to fill array_in
    std::random_device rand_dev;
    std::mt19937 gen(rand_dev());
    std::uniform_real_distribution<> dist(-1.0, 1.0);
    for(size_t i = 0; i < array_length; i++) array_in[i] = dist(gen);
    // Calls the reduction kernel
    cuda_reduction < block_Size > <<< grid_Size, block_Size, block_Size * sizeof(float) >>> (array_in, array_out, array_length);
    cuda_reduction < 1 > <<< 1, block_Size, block_Size * sizeof(float) >>> (array_out, array_out, grid_Size);
    cudaDeviceSynchronize();
    // Prints grid_Size elements of array_out
    for(size_t i = 0; i < grid_Size; i++) std::cout << array_out[i] << std::endl;

    cudaFree(array_in);
    cudaFree(array_out);
    return 0;
    }

So, if I print array_in and do a SUM() on the elements in Excel, the result is not among these intermediate sums in array_out after the second kernel.

The reduce6 kernel you are working on from the slideware you accessed has multiple defects in it (its quite old). If you search carefully, you can find an updated version that has some fixes, but what would be even easier is just to use the final kernel from the reduction sample code. However, CUDA 9 versions of that have been reworked to use cooperative groups, which will take us down another alley. So focusing on the code you have, then:

here’s one defect:

while (i < array_len)
        { sdata[tid] += array_in[i] + array_in[i + blockSize];

suppose i < array_len. in fact, suppose i = array_len -1
that will pass the conditional test, which means the while loop body can execute with that value of i. so will i+blockSize still be in bounds? I doubt it. OOPS! (Yes, this defect is in the original code you lifted. Stated another way, the original code only works for certain array sizes).

Which brings me to my next point. Any time you are having trouble with a CUDA code, you should always do proper CUDA error checking and run your code with cuda-memcheck, before asking others for help. Your code is throwing errors detectable by cuda-memcheck. Chasing those errors would eventually lead to the above observation about i in the while loop.

Another problem with that original reduce6 kernel was the use of shared memory in warp-synchronous programming. This is now all deprecated, which is why the sample code has been reworked, but an intermediate “fix” is to mark the shared memory pointer as volatile.

Your use of 1 as the template parameter for the 2nd kernel call should have been block_Size. Not sure why you chose 1. And yes, you should be flipping buffers on the 2nd kernel call, not use the out buffer for both in and out data. The reduction code is not designed to work correctly in-place.

The following code produces expected results for me after the above changes. Differences between CPU and GPU calculations in the 5th or 6th significant digit I attribute to floating point order of operations, not any defect in the code.

$ cat t282.cu
#include <iostream>
#include <random>

template <unsigned int blockSize>
__global__ void cuda_reduction(float *array_in, float *reduct, size_t array_len)
    {
    extern volatile __shared__ float sdata[];
    size_t  tid        = threadIdx.x,
            gridSize   = blockSize * gridDim.x,
            i          = blockIdx.x * blockSize + tid;
    sdata[tid] = 0;
    while (i < array_len)
        { sdata[tid] += array_in[i];
        i += gridSize; }
    __syncthreads();
    if (blockSize >= 512)
        { if (tid < 256) sdata[tid] += sdata[tid + 256]; __syncthreads(); }
    if (blockSize >= 256)
        { if (tid < 128) sdata[tid] += sdata[tid + 128]; __syncthreads(); }
    if (blockSize >= 128)
        { if (tid <  64) sdata[tid] += sdata[tid + 64]; __syncthreads(); }
    if (tid < 32)
        { if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
          if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
          if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
          if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];
          if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];
          if (blockSize >= 2)  sdata[tid] += sdata[tid + 1]; }
    if (tid == 0) reduct[blockIdx.x] = sdata[0];
    }

int main(void)
    {
    float *array_in, *array_out;
    size_t array_length = 100000;
    const unsigned int grid_Size = 256, block_Size = 128;

    cudaMallocManaged(&array_in, array_length * sizeof(float));
    cudaMallocManaged(&array_out, array_length * sizeof(float));
    // C++ random number generator to fill array_in
    std::random_device rand_dev;
    std::mt19937 gen(rand_dev());
    std::uniform_real_distribution<> dist(-1.0, 1.0);
    float cpu_sum = 0.0f;
    for(size_t i = 0; i < array_length; i++) {array_in[i] = dist(gen); cpu_sum += array_in[i];}
    // Calls the reduction kernel
    cuda_reduction < block_Size > <<< grid_Size, block_Size, block_Size * sizeof(float) >>> (array_in, array_out, array_length);
    cuda_reduction < block_Size > <<< 1, block_Size, block_Size * sizeof(float) >>> (array_out, array_in, grid_Size);
    cudaDeviceSynchronize();
    // Prints grid_Size elements of array_out
    std::cout << "cpu: " << cpu_sum << std::endl;
    std::cout << "gpu: " << array_in[0] << std::endl;
    cudaFree(array_in);
    cudaFree(array_out);
    return 0;
    }
$ nvcc -arch=sm_52 -std=c++11 -o t282 t282.cu
$ cuda-memcheck ./t282
========= CUDA-MEMCHECK
cpu: 417.379
gpu: 417.38
========= ERROR SUMMARY: 0 errors
$ ./t282
cpu: 202.244
gpu: 202.244
$ ./t282
cpu: 129.313
gpu: 129.313
$ ./t282
cpu: -111.757
gpu: -111.756
$

txbob, spotting this stride going out of bounds even if “i < len” was providential. This *2 multiplication on the definitions of “grid_Size” and “i” too. What I used as base was a presentation with a slightly different slide 35 (older, as you say), but I thought it was just splitting the final “tid” tests into another function and didn’t pay attention to the the other elements.

Now I have a yellow post-it on my monitor reminding me about nvprof and cuda-memcheck, as the __CUDA_CALL() macro alone doesn’t catch everything.

It runs exactly as it should now, the 1080Ti doesn’t break a sweat reducing gigantic arrays. Thank you once again for your patience explaining things and this excellent assistance with code.

It will catch errors in API calls. To catch errors related to kernel invocations, you would want to use something like the following macro:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

Thanks, Njuffa.
This one is going to my macro list.