Reduction

Hi all!

I want to port my code from c++ to CUDA and in my code I want to project environment light to spherical harmonics basis and find some coefficients, my main code is something like this:

void InfiniteLighting::projectLightFunc( int lmax, int Nfunc)

{	

	m_coeff = new Color[Nfunc];

	for (int i =0 ; i< Nfunc; i++)

	{

		m_coeff[i] = 0.0;

	}

	

	float theta, phi;

	for (int  i =0; i< m_iHeight ; ++i)

	{

		theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;

		for (int j=0; j < m_iWidth; ++j)

		{

			phi = ((float(j)+0.5f)/float(m_iWidth)) * M_PI * 2.0f;

			Color Le = LightProbeAccess(j + i * m_iWidth);

			for (int l=0; l < lmax ; ++l)

			{

				for(int m = -l; m <= l ; ++m)

				{

					int indx_Coff = l *(l+1) + m;

					m_coeff[indx_Coff] += Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth) ;

					

				}

			}

		}

		

	}

}

I know how to remove the two outer loops but for calculating m_coeff I don’t know how to make sure each thread is writing the correct value and how to make the synchronization. I think it’s kind of a reduction process like dot product but i’m not sure about it.

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += …; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don’t know … and how to make the synchronization?

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn’t know about these functions since I’ve just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it’s not very large? and I’m also confused that how should I choose the number of blocks and threads to be efficient enough?

Hello,

The problem is in a command a=a+b; for 32 times. So the results will be a+32b. Lets take the example of a warp. There are 32 thread. The memory read is per warp. Lets assume a=5 before the line. Now a warp issues a command to read a. They all get 5 , they all add b to 5. So instead of having in the end 5+32*b we will only have 5+b. Without atomicAdd the results will not be correct. The atomicAdd functions make sure that if a thread reads some value from a value, no other thread will read that address until the first thread finished the job. That means that each thread will wait for the other to finish. So if you have hundreds of threads try to do atomicAdd on one address they will all be waiting a long time. The solution for this is to declare a shared array, this way only the threads in a block will wait for each other. There is a good example of this in the book CUDA by example in the chapter with histograms. Here it is: http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0 If you do not have access to book, just download the example codes and look for the ones for histograms.

The only thinkg you need to worry is the number of threads per block. These depend on home many resources are used (registers, shared memory) and in Fermi there can not be more than 1024. There is no communication between blocks, so if you choose a specific number of threads per block then the number of blocks has to be large enough so that there are enough threads to do the work. Your problem depends a lot on nfunc. If nfucn is only 16 you will need to use the share memory to store intermediate results. The simplest (fastest to implement) thing is to have m_iHeight x m_iWidth threads, each of it doing the 2 inner loops , with m_coefftemp[nfunc] as a temporary storage, and then have the results written to the gloabl memory at the end of the kernel.

Hi Saghi,

You don’t need to use atomic operations.

If the expression “Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)” is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken

Yes. The previous suggestion might be more appropriate. You can submit m_iHeight * m_iWidth * Nfuncn threads. You can retrieve the indices easy, by submitting 3D grids with 3D blocks. After there is still reduction to be done.

For example if one would have to do a sum of N elements with very large N, it would be done something like this:

One would submit N threads wgrouped in nbl (number of blocks) with tbl threads per block. Now each thread in each block would produce one element (though it can be more depending on problem) and put the value in a temporary array temp[tbl]. Now there would be a loop in which at each iteration the element ion the temp array would add ot itself the element tem[i+offset], where offset=tbl/2 at first iteration and then it would get halved every iteration until it gets to 1. This way each block would now give one element and we would have an array. Here there is a good article for reduction http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html also the book I suggested has good examples of reduction.

The threads can be organized in such a way that each thread would give up to 3 indices and each block up to 3 also, so you have 6 indexes you can play arround.

In general one should try to avoid atomicAdd, too many writes to the same address will slow down the code a lot. Instead one should use the shared memory as a temporary storage.

Thank you both for your great explanation, I think I’ve got your point and it can be the best way to port my code to CUDA. I’m now trying to do what you said and see how it goes. :-)

just one more thing that makes me confuse is that I know I’ll have m_iHeight * m_iWidth * Nfuncn threads but does it make difference how many threads per block I submit?

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4. Then my blocks and threads will be like this :

int N = m_iWidth * m_iHeight * NFunc;

    dim3 blockD(16,8,16);

    dim3 gridD((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

Then how can i find the correct index for i, j and l? is the following code correct?

int i = blockDim.x * blockIdx.x + threadIdx.x;

    int j = blockDim.y * blockIdx.y + threadIdx.y;

    int l = threadIdx.z;

sorry if I’m asking basic questions, I’m very new to this concept and afraid of making mistakes.

Hello,

You are on the good path, just a few remarks.

I an not sure if these declarations are correct:

dim3 blockD(16,8,16);

dim3 gridD((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

I would rather use:

dim3 blockD=dim3(16,8,16);

dim3 gridD=dim3((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

or just somthing like dim3 blockD; and then blockD.x=16; and so on

See which one works

For Fermi (it can be different for other architectures) blockD.xblockD.yblockD.z<=1024. In your case blockD.xblockD.yblockD.z=2048

Next you are submitting lots of blocks doing duplicate work, because gridD=dim3((blockD.x -1 + N)/blockD.x , ((blockD.y -1 + N)/blockD.y), (blockD.z -1 + N)/blockD.z );

but in the kernel you do not use at all blockIdx.z.

so you can in fact use maybe, if your algorithm permits

dim3 blockD=dim3(16,8,1);

dim3 gridD=dim3((blockD.x -1 + Nx)/blockD.x , (blockD.y -1 + Ny)/blockD.y/, 16 ); // where Nx*Ny =N

and in the kernel

int i = blockDim.x * blockIdx.x + threadIdx.x;

    int j = blockDim.y * blockIdx.y + threadIdx.y;

    int l = blockIdx.z;

2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it’s 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

#include <iostream>

#include <vector>

const int max_i = 2336;

const int max_j = 1752;

const int max_m = 16;

const int max_l = 4;

__global__ void kernel(bool * found, int total)

{

    int i = threadIdx.x + blockIdx.x * blockDim.x;

    int j = threadIdx.y + blockIdx.y * blockDim.y;

    int z = threadIdx.z;

int l = z % max_l;

    int m = z / max_l;

// check out of bounds.

    if (i >= max_i)

        return;

    if (j >= max_j)

        return;

    if (l >= max_l)

        return;

    if (m >= max_m)

        return;

int p = i * max_j * max_m * max_l 

                    + j * max_m * max_l

                            + m * max_l

                                    + l;

    if (p < total)

        found[p] = true;

}

int main()

{

    int total = max_i * max_j * max_m * max_l;

    bool * found = (bool *) malloc(total);

std::cout << "total = " << total << "\n";

bool * d_found = 0;

    cudaError_t e2 = cudaMalloc(&d_found, sizeof(bool) * total);

    if (d_found == 0)

        return 1;

    if (e2 != 0)

        return 1;

dim3 threads(16, 16);

    int dim_x = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );

    int dim_y = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );

    int dim_z = max_m * max_l;

    dim3 grid(dim_x, dim_y, dim_z);

kernel<<<grid, threads>>>(d_found, total);

    cudaDeviceSynchronize();

    cudaError_t e = cudaGetLastError();

    if (e)

        std::cout << cudaGetErrorString(e) << "\n";

void * ptr = (void*) found;

    cudaMemcpy(ptr, d_found, 256, cudaMemcpyDeviceToHost);

    for (int d = 0; d < 256; ++d)

        std::cout << "d = " << d << " " << found[d] << "\n";

    std::cout << "\n";

    return 0;

}

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that “>>4” code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There’s going to be some unused threads, but that shouldn’t be a problem. BTW, the kernel has returns. That isn’t exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the “return;”. Something to get use to.

Now for the m_coeff array… Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do “strip mining”, i.e., do more work per thread, and represent partial sums with a smaller array. In that case, “l” could be mapped to z, and in that kernel, do a for-loop over “m”. If that still is too big a space, then you could map “m” to z, and do a for-loop over “l”. As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

Ken

Tnx you both! now it’s more clear for me… trying to implement and compile the code to see if it’s working or not. anyway Thanks a lot for your helps! :-)

I see a sinf() call in the innermost loop whose argument is of the form pi * expression. In such cases, substituting sinpif() is recommended (see also the Best Practices Guide). In other words, this code:

theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;

sinf(theta)

would turn into code like this

ratio = ((float(i) + 0.5f)/float(m_iHeight));

theta = ratio * M_PI;

sinpif (ratio);

Did you try to use this from the programming guide:

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)

{

// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads

__threadfence();

// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));

}

// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone

__syncthreads();

if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;

}

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

Well now I’m stuck with this part, apparently I can’t use 3D grid in Gforce 9800 GT and this is how I’m defining my grid size :

int Nx = width, Ny = height, Nz = 16;

dim3 blockD(16, 8, 1);

        dim3 gridD((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y, Nz);

and when I launch my kernel I get this error : invalid configuration argument.

As I’ve searched the error it seems that the grid or block size is not correct. Does anyone know about this error?

Unfortunately, your device is very old. You are getting an invalid config because there is no z-dimension in a grid for 1.x devices. In the CUDA Programmer Guide, check out Table A-1 and Table F-2. I’d recommend you set the block dimensions to (4, 4, 16) and the grid to ((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y). Then, in your kernel, make sure to compute the (i, j, l, m) I showed before. That should work. I have a 9800 also, and that gets past the invalid config problem. --Ken

yeah you were right! that solved the problem! tnx a lot! External Image

Hi again!

I’m still struggling with my cuda code and this is what I’ve got so far:

#include "cuda.h"

#include "cuda_runtime_api.h"

#include <stdio.h>

#include <iostream>

#include "SphericalHarmonics.h"

#include "CudaErr.h"

#include "Color.h"

__device__ SH _sphericalH;

__global__ void projectLightFunc(int lmax, int Nfunc, int _width, int _height, Color * image, Color* coeff) {

__shared__ Color cache_test[8*8*16];

int j = blockDim.x * blockIdx.x + threadIdx.x;

    int i = blockDim.y * blockIdx.y + threadIdx.y;

    int offset = j + i *_width;

int indx = threadIdx.x * 8 + threadIdx.y;

if (i >= _height)

        return;

    if (j >= _width)

        return;

float theta, phi;

    float ratio = ((float(i) + 0.5f) / float(_height));

    theta = ratio * M_PI;

    phi = ((float(j) + 0.5f) / float(_width)) * M_PI * 2.0f;

Color Le = image[offset];

for (int l = 0; l < lmax; l++)

        for (int m = -l; m <= l; m++) {

            int indx_Coeff = l * (l + 1) + m;

cache_test[indx * Nfunc + indx_Coeff] =  Le * _sphericalH.CalculateSh(l, m, theta, phi) * sinpif(ratio) * (M_PI / _height) * (2.0f * M_PI / _width);

}

__syncthreads();

    int z = 8*8 / 2;

while( z !=0) {

        if(indx < z)

        {

            for(int j = 0; j < Nfunc ; j++)

                cache_test[indx * Nfunc + j] += cache_test[(indx + z) * Nfunc + j];

}

         __syncthreads();

         z/=2;

    }

if(indx == 0)

    {

for(int i=0; i< Nfunc ; i++ )

         {

                coeff[i] += cache_test[i];

}

}

}

extern "C" {

Color* StartLightProjection(int Nfunc, int width, int height, Color *image) {

int Nx = width, Ny = height;

        int lmax = sqrt(Nfunc);

Color  *dev_Image, *dev_coeff, *coeff ;

dim3 blockD(8, 8);

        dim3 gridD((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y);

std::cout << " gridD : <" << gridD.x << "," << gridD.y << "," << gridD.z << ">" << std::endl;

        std::cout << " image size : " << width << " * " << height << std::endl;

        std::cout << " image memory needed : " << width * height * Nfunc * sizeof (Color) << std::endl;

        std::cout << " lmax : " << lmax << std::endl;

coeff = (Color*) malloc(Nfunc * sizeof (Color));

for(int i=0 ; i < Nfunc ; i++)

            coeff[i] = 0.0f;

CudaSafeCall(cudaMalloc((void**) &dev_coeff,

                Nfunc * sizeof (Color)));

        CudaSafeCall(cudaMalloc((void**) &dev_Image,

                width * height * sizeof (Color)));

CudaSafeCall(cudaMemcpy(dev_Image, image, width * height * sizeof (Color),

                cudaMemcpyHostToDevice));

        CudaSafeCall(cudaMemcpy(dev_coeff, coeff, Nfunc * sizeof (Color),

                cudaMemcpyHostToDevice));

projectLightFunc <<< gridD, blockD >>>(lmax, Nfunc, width, height, dev_Image, dev_coeff);

CudaCheckError();

CudaSafeCall(cudaMemcpy(coeff, dev_coeff, Nfunc * sizeof (Color),

                cudaMemcpyDeviceToHost));

cudaFree(dev_Image);

       cudaFree(dev_coeff);

return coeff;

}

}

The problem that I have is that each time I run my code I get a different answer which are close to each other but still not stable and completely correct. if i do the reduction on CPU and instead of shared memory use a large global memory then everything works fine, but i have to improve my timing as much as possible! do you have any idea? also I’m not sure if I use texture memory and upload my image to it, it gets faster or not External Image

thanks in advance for all your helps!

Hello,

The problrem is in this part:

if(indx == 0)

    {

for(int i=0; i< Nfunc ; i++ )

         {

                coeff[i] += cache_test[i];

}

}

In that place you should use atomic add or save cache_test[i] to some array in global memory and use 2 kernels. For production do not use just a block of (8,8). More threads might give better performance.

Just a suggestion for the last code:

if(indx == 0)

    {

     Color temp=0;

        for(int i=0; i< Nfunc ; i++ )

         {

                temp += cache_test[i];

}

         atomicAdd(&coeff[i],temp)

    }

Hi, sorry for my late reply. I’ve a problem with atomicAdd and whenever I use it in my program it gives me this error:

error: identifier “atomicAdd” is undefined

is this because my GPU is not supporting these functions or is this something else?

Hello,

The atomic functions are supported only on 1.1 or higher compute capability. So first you need to have a 1.1 cc gpu and second you need to specify the architecture when compiling by adding flag -arch=sm_11 (replace 11 with higher if you have higher, you can have 12 13 20 21), otherwise the code is automatically produced for running on 1.0 cc.