I want to port my code from c++ to CUDA and in my code I want to project environment light to spherical harmonics basis and find some coefficients, my main code is something like this:

[code]
void InfiniteLighting::projectLightFunc( int lmax, int Nfunc)
{

m_coeff = new Color[Nfunc];
for (int i =0 ; i< Nfunc; i++)
{
m_coeff[i] = 0.0;
}

float theta, phi;
for (int i =0; i< m_iHeight ; ++i)
{
theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;

for (int j=0; j < m_iWidth; ++j)
{
phi = ((float(j)+0.5f)/float(m_iWidth)) * M_PI * 2.0f;
Color Le = LightProbeAccess(j + i * m_iWidth);
for (int l=0; l < lmax ; ++l)
{
for(int m = -l; m <= l ; ++m)
{
int indx_Coff = l *(l+1) + m;

I know how to remove the two outer loops but for calculating m_coeff I don't know how to make sure each thread is writing the correct value and how to make the synchronization. I think it's kind of a reduction process like dot product but i'm not sure about it.

I want to port my code from c++ to CUDA and in my code I want to project environment light to spherical harmonics basis and find some coefficients, my main code is something like this:

void InfiniteLighting::projectLightFunc( int lmax, int Nfunc)

I know how to remove the two outer loops but for calculating m_coeff I don't know how to make sure each thread is writing the correct value and how to make the synchronization. I think it's kind of a reduction process like dot product but i'm not sure about it.

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

[quote name='pasoleatis' date='04 April 2012 - 12:16 PM' timestamp='1333538180' post='1391847']
For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?
[/quote]

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

[quote name='pasoleatis' date='04 April 2012 - 12:16 PM' timestamp='1333538180' post='1391847']

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

[quote name='Saghi' date='04 April 2012 - 01:30 PM' timestamp='1333542621' post='1391867']
Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?
[/quote]
Hello,

The problem is in a command a=a+b; for 32 times. So the results will be a+32b. Lets take the example of a warp. There are 32 thread. The memory read is per warp. Lets assume a=5 before the line. Now a warp issues a command to read a. They all get 5 , they all add b to 5. So instead of having in the end 5+32*b we will only have 5+b. Without atomicAdd the results will not be correct. The atomicAdd functions make sure that if a thread reads some value from a value, no other thread will read that address until the first thread finished the job. That means that each thread will wait for the other to finish. So if you have hundreds of threads try to do atomicAdd on one address they will all be waiting a long time. The solution for this is to declare a shared array, this way only the threads in a block will wait for each other. There is a good example of this in the book CUDA by example in the chapter with histograms. Here it is: http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0 If you do not have access to book, just download the example codes and look for the ones for histograms.

The only thinkg you need to worry is the number of threads per block. These depend on home many resources are used (registers, shared memory) and in Fermi there can not be more than 1024. There is no communication between blocks, so if you choose a specific number of threads per block then the number of blocks has to be large enough so that there are enough threads to do the work. Your problem depends a lot on nfunc. If nfucn is only 16 you will need to use the share memory to store intermediate results. The simplest (fastest to implement) thing is to have m_iHeight x m_iWidth threads, each of it doing the 2 inner loops , with m_coefftemp[nfunc] as a temporary storage, and then have the results written to the gloabl memory at the end of the kernel.

[quote name='Saghi' date='04 April 2012 - 01:30 PM' timestamp='1333542621' post='1391867']

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

Hello,

The problem is in a command a=a+b; for 32 times. So the results will be a+32b. Lets take the example of a warp. There are 32 thread. The memory read is per warp. Lets assume a=5 before the line. Now a warp issues a command to read a. They all get 5 , they all add b to 5. So instead of having in the end 5+32*b we will only have 5+b. Without atomicAdd the results will not be correct. The atomicAdd functions make sure that if a thread reads some value from a value, no other thread will read that address until the first thread finished the job. That means that each thread will wait for the other to finish. So if you have hundreds of threads try to do atomicAdd on one address they will all be waiting a long time. The solution for this is to declare a shared array, this way only the threads in a block will wait for each other. There is a good example of this in the book CUDA by example in the chapter with histograms. Here it is: http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0 If you do not have access to book, just download the example codes and look for the ones for histograms.

The only thinkg you need to worry is the number of threads per block. These depend on home many resources are used (registers, shared memory) and in Fermi there can not be more than 1024. There is no communication between blocks, so if you choose a specific number of threads per block then the number of blocks has to be large enough so that there are enough threads to do the work. Your problem depends a lot on nfunc. If nfucn is only 16 you will need to use the share memory to store intermediate results. The simplest (fastest to implement) thing is to have m_iHeight x m_iWidth threads, each of it doing the 2 inner loops , with m_coefftemp[nfunc] as a temporary storage, and then have the results written to the gloabl memory at the end of the kernel.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

[quote name='Ken Domino' date='04 April 2012 - 09:42 PM' timestamp='1333572166' post='1392048']
Hi Saghi,

You don't need to use atomic operations.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken
[/quote]

Yes. The previous suggestion might be more appropriate. You can submit m_iHeight * m_iWidth * Nfuncn threads. You can retrieve the indices easy, by submitting 3D grids with 3D blocks. After there is still reduction to be done.

For example if one would have to do a sum of N elements with very large N, it would be done something like this:

One would submit N threads wgrouped in nbl (number of blocks) with tbl threads per block. Now each thread in each block would produce one element (though it can be more depending on problem) and put the value in a temporary array temp[tbl]. Now there would be a loop in which at each iteration the element ion the temp array would add ot itself the element tem[i+offset], where offset=tbl/2 at first iteration and then it would get halved every iteration until it gets to 1. This way each block would now give one element and we would have an array. Here there is a good article for reduction http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html also the book I suggested has good examples of reduction.

The threads can be organized in such a way that each thread would give up to 3 indices and each block up to 3 also, so you have 6 indexes you can play arround.

In general one should try to avoid atomicAdd, too many writes to the same address will slow down the code a lot. Instead one should use the shared memory as a temporary storage.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken

Yes. The previous suggestion might be more appropriate. You can submit m_iHeight * m_iWidth * Nfuncn threads. You can retrieve the indices easy, by submitting 3D grids with 3D blocks. After there is still reduction to be done.

For example if one would have to do a sum of N elements with very large N, it would be done something like this:

One would submit N threads wgrouped in nbl (number of blocks) with tbl threads per block. Now each thread in each block would produce one element (though it can be more depending on problem) and put the value in a temporary array temp[tbl]. Now there would be a loop in which at each iteration the element ion the temp array would add ot itself the element tem[i+offset], where offset=tbl/2 at first iteration and then it would get halved every iteration until it gets to 1. This way each block would now give one element and we would have an array. Here there is a good article for reduction http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html also the book I suggested has good examples of reduction.

The threads can be organized in such a way that each thread would give up to 3 indices and each block up to 3 also, so you have 6 indexes you can play arround.

In general one should try to avoid atomicAdd, too many writes to the same address will slow down the code a lot. Instead one should use the shared memory as a temporary storage.

Thank you both for your great explanation, I think I've got your point and it can be the best way to port my code to CUDA. I'm now trying to do what you said and see how it goes. :-)

just one more thing that makes me confuse is that I know I'll have m_iHeight * m_iWidth * Nfuncn threads but does it make difference how many threads per block I submit?

Thank you both for your great explanation, I think I've got your point and it can be the best way to port my code to CUDA. I'm now trying to do what you said and see how it goes. :-)

just one more thing that makes me confuse is that I know I'll have m_iHeight * m_iWidth * Nfuncn threads but does it make difference how many threads per block I submit?

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4. Then my blocks and threads will be like this :

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4. Then my blocks and threads will be like this :

I would rather use:
[code]
dim3 blockD=dim3(16,8,16);
dim3 gridD=dim3((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );
[/code]
or just somthing like dim3 blockD; and then blockD.x=16; and so on

See which one works

For Fermi (it can be different for other architectures) blockD.x*blockD.y*blockD.z<=1024. In your case blockD.x*blockD.y*blockD.z=2048

Next you are submitting lots of blocks doing duplicate work, because gridD=dim3((blockD.x -1 + N)/blockD.x , ((blockD.y -1 + N)/blockD.y), (blockD.z -1 + N)/blockD.z );

but in the kernel you do not use at all blockIdx.z.

so you can in fact use maybe, if your algorithm permits

[code]
dim3 blockD=dim3(16,8,1);
dim3 gridD=dim3((blockD.x -1 + Nx)/blockD.x , (blockD.y -1 + Ny)/blockD.y/, 16 ); // where Nx*Ny =N
[/code]
and in the kernel
[code]
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
int l = blockIdx.z;
[/code]

or just somthing like dim3 blockD; and then blockD.x=16; and so on

See which one works

For Fermi (it can be different for other architectures) blockD.x*blockD.y*blockD.z<=1024. In your case blockD.x*blockD.y*blockD.z=2048

Next you are submitting lots of blocks doing duplicate work, because gridD=dim3((blockD.x -1 + N)/blockD.x , ((blockD.y -1 + N)/blockD.y), (blockD.z -1 + N)/blockD.z );

but in the kernel you do not use at all blockIdx.z.

so you can in fact use maybe, if your algorithm permits

[quote name='Saghi' date='05 April 2012 - 09:29 AM' timestamp='1333632552' post='1392266']
Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4.
[/quote]

2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it's 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

[code]

#include <iostream>
#include <vector>

const int max_i = 2336;
const int max_j = 1752;
const int max_m = 16;
const int max_l = 4;

__global__ void kernel(bool * found, int total)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;
int z = threadIdx.z;

int l = z % max_l;
int m = z / max_l;

// check out of bounds.
if (i >= max_i)
return;
if (j >= max_j)
return;
if (l >= max_l)
return;
if (m >= max_m)
return;

int p = i * max_j * max_m * max_l
+ j * max_m * max_l
+ m * max_l
+ l;
if (p < total)
found[p] = true;
}

int main()
{
int total = max_i * max_j * max_m * max_l;
bool * found = (bool *) malloc(total);

kernel<<<grid, threads>>>(d_found, total);
cudaDeviceSynchronize();
cudaError_t e = cudaGetLastError();
if (e)
std::cout << cudaGetErrorString(e) << "\n";

void * ptr = (void*) found;
cudaMemcpy(ptr, d_found, 256, cudaMemcpyDeviceToHost);
for (int d = 0; d < 256; ++d)
std::cout << "d = " << d << " " << found[d] << "\n";
std::cout << "\n";
return 0;
}
[/code]

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that ">>4" code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There's going to be some unused threads, but that shouldn't be a problem. BTW, the kernel has returns. That isn't exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the "return;". Something to get use to.

Now for the m_coeff[] array... Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do "strip mining", i.e., do more work per thread, and represent partial sums with a smaller array. In that case, "l" could be mapped to z, and in that kernel, do a for-loop over "m". If that still is too big a space, then you could map "m" to z, and do a for-loop over "l". As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

[quote name='Saghi' date='05 April 2012 - 09:29 AM' timestamp='1333632552' post='1392266']

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4.

2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it's 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that ">>4" code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There's going to be some unused threads, but that shouldn't be a problem. BTW, the kernel has returns. That isn't exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the "return;". Something to get use to.

Now for the m_coeff[] array... Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do "strip mining", i.e., do more work per thread, and represent partial sums with a smaller array. In that case, "l" could be mapped to z, and in that kernel, do a for-loop over "m". If that still is too big a space, then you could map "m" to z, and do a for-loop over "l". As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

Tnx you both! now it's more clear for me... trying to implement and compile the code to see if it's working or not. anyway Thanks a lot for your helps! :-)

Tnx you both! now it's more clear for me... trying to implement and compile the code to see if it's working or not. anyway Thanks a lot for your helps! :-)

I see a sinf() call in the innermost loop whose argument is of the form pi * expression. In such cases, substituting sinpif() is recommended (see also the Best Practices Guide). In other words, this code:

[code]
theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;
sinf(theta)
[/code]
would turn into code like this

[code]
ratio = ((float(i) + 0.5f)/float(m_iHeight));
theta = ratio * M_PI;
sinpif (ratio);
[/code]

I see a sinf() call in the innermost loop whose argument is of the form pi * expression. In such cases, substituting sinpif() is recommended (see also the Best Practices Guide). In other words, this code:

Did you try to use this from the programming guide:

[quote]
In general, when a thread issues a series of writes to memory in a particular order,
other threads may see the effects of these memory writes in a different order.
__threadfence_block(), __threadfence(), and
__threadfence_system() can be used to enforce some ordering.
One use case is when threads consume some data produced by other threads as
illustrated by the following code sample of a kernel that computes the sum of an
array of N numbers in one call. Each block first sums a subset of the array and
stores the result in global memory. When all blocks are done, the last block done
reads each of these partial sums from global memory and sums them to obtain the
final result. In order to determine which block is finished last, each block atomically
increments a counter to signal that it is done with computing and storing its partial
sum (see Section B.11 about atomic functions). The last block is the one that
receives the counter value equal to gridDim.x-1. If no fence is placed between
storing the partial sum and incrementing the counter, the counter might increment
before the partial sum is stored and therefore, might reach gridDim.x-1 and let
the last block start reading partial sums before they have been actually updated in
memory.
[/quote]
[code]
__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
float* result)
{
// Each block sums a subset of the input array
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum
// to global memory
result[blockIdx.x] = partialSum;
// Thread 0 makes sure its result is visible to
// all other threads
__threadfence();
// Thread 0 of each block signals that it is done
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 of each block determines if its block is
// the last block to be done
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores total sum
// to global memory and resets count so that
// next kernel call works properly
result[0] = totalSum;
count = 0;
}

[/code]

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

Did you try to use this from the programming guide:

In general, when a thread issues a series of writes to memory in a particular order,

other threads may see the effects of these memory writes in a different order.

__threadfence_block(), __threadfence(), and

__threadfence_system() can be used to enforce some ordering.

One use case is when threads consume some data produced by other threads as

illustrated by the following code sample of a kernel that computes the sum of an

array of N numbers in one call. Each block first sums a subset of the array and

stores the result in global memory. When all blocks are done, the last block done

reads each of these partial sums from global memory and sums them to obtain the

final result. In order to determine which block is finished last, each block atomically

increments a counter to signal that it is done with computing and storing its partial

sum (see Section B.11 about atomic functions). The last block is the one that

receives the counter value equal to gridDim.x-1. If no fence is placed between

storing the partial sum and incrementing the counter, the counter might increment

before the partial sum is stored and therefore, might reach gridDim.x-1 and let

the last block start reading partial sums before they have been actually updated in

memory.

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)

{

// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads

__threadfence();

// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));

}

// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone

__syncthreads();

if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;

}

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

Well now I'm stuck with this part, apparently I can't use 3D grid in Gforce 9800 GT and this is how I'm defining my grid size :
[code]
int Nx = width, Ny = height, Nz = 16;

[quote name='Saghi' date='11 April 2012 - 01:33 PM' timestamp='1334165607' post='1394914']
Well now I'm stuck with this part, apparently I can't use 3D grid in Gforce 9800 GT and this is how I'm defining my grid size :
[code]
int Nx = width, Ny = height, Nz = 16;

and when I launch my kernel I get this error : invalid configuration argument.

As I've searched the error it seems that the grid or block size is not correct. Does anyone know about this error?
[/quote]

Unfortunately, your device is very old. You are getting an invalid config because there is no z-dimension in a grid for 1.x devices. In the CUDA Programmer Guide, check out Table A-1 and Table F-2. I'd recommend you set the block dimensions to (4, 4, 16) and the grid to ((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y). Then, in your kernel, make sure to compute the (i, j, l, m) I showed before. That should work. I have a 9800 also, and that gets past the invalid config problem. --Ken

and when I launch my kernel I get this error : invalid configuration argument.

As I've searched the error it seems that the grid or block size is not correct. Does anyone know about this error?

Unfortunately, your device is very old. You are getting an invalid config because there is no z-dimension in a grid for 1.x devices. In the CUDA Programmer Guide, check out Table A-1 and Table F-2. I'd recommend you set the block dimensions to (4, 4, 16) and the grid to ((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y). Then, in your kernel, make sure to compute the (i, j, l, m) I showed before. That should work. I have a 9800 also, and that gets past the invalid config problem. --Ken

I want to port my code from c++ to CUDA and in my code I want to project environment light to spherical harmonics basis and find some coefficients, my main code is something like this:

[code]

void InfiniteLighting::projectLightFunc( int lmax, int Nfunc)

{

m_coeff = new Color[Nfunc];

for (int i =0 ; i< Nfunc; i++)

{

m_coeff[i] = 0.0;

}

float theta, phi;

for (int i =0; i< m_iHeight ; ++i)

{

theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;

for (int j=0; j < m_iWidth; ++j)

{

phi = ((float(j)+0.5f)/float(m_iWidth)) * M_PI * 2.0f;

Color Le = LightProbeAccess(j + i * m_iWidth);

for (int l=0; l < lmax ; ++l)

{

for(int m = -l; m <= l ; ++m)

{

int indx_Coff = l *(l+1) + m;

m_coeff[indx_Coff] += Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth) ;

}

}

}

}

}

[/code]

I know how to remove the two outer loops but for calculating m_coeff I don't know how to make sure each thread is writing the correct value and how to make the synchronization. I think it's kind of a reduction process like dot product but i'm not sure about it.

I want to port my code from c++ to CUDA and in my code I want to project environment light to spherical harmonics basis and find some coefficients, my main code is something like this:

I know how to remove the two outer loops but for calculating m_coeff I don't know how to make sure each thread is writing the correct value and how to make the synchronization. I think it's kind of a reduction process like dot product but i'm not sure about it.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

If this works, then the next step would be to define m_coeff array as shared and only copy the results to the global memory at the end of the block. This will be effective if Nfucn is small enough to fit in the shared memory.

What do you mean by "I don't know ... and how to make the synchronization?

[/quote]

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

For the start you can try to submit m_iHeight by m_iWidth threads and have each thread do the inner loop. In order to get this line correct m_coeff[indx_Coff] += ....; you will have to use the atomicAdd functions. This is the easiest thing to do. This wwill be effective if Nfunc is very large. That would be the first step.

What do you mean by "I don't know ... and how to make the synchronization?

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

Thanks for your explanation! My problem was exactly what atomicAdd can solve when two threads want to increase the same memory address in the same time. I didn't know about these functions since I've just started to learn CUDA. Nfunc is not very large maybe 16 but the number of adds on each array element will be large. You said the performance will be high if Nfunc is very large but what if it's not very large? and I'm also confused that how should I choose the number of blocks and threads to be efficient enough?

[/quote]

Hello,

The problem is in a command a=a+b; for 32 times. So the results will be a+32b. Lets take the example of a warp. There are 32 thread. The memory read is per warp. Lets assume a=5 before the line. Now a warp issues a command to read a. They all get 5 , they all add b to 5. So instead of having in the end 5+32*b we will only have 5+b. Without atomicAdd the results will not be correct. The atomicAdd functions make sure that if a thread reads some value from a value, no other thread will read that address until the first thread finished the job. That means that each thread will wait for the other to finish. So if you have hundreds of threads try to do atomicAdd on one address they will all be waiting a long time. The solution for this is to declare a shared array, this way only the threads in a block will wait for each other. There is a good example of this in the book CUDA by example in the chapter with histograms. Here it is: http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0 If you do not have access to book, just download the example codes and look for the ones for histograms.

The only thinkg you need to worry is the number of threads per block. These depend on home many resources are used (registers, shared memory) and in Fermi there can not be more than 1024. There is no communication between blocks, so if you choose a specific number of threads per block then the number of blocks has to be large enough so that there are enough threads to do the work. Your problem depends a lot on nfunc. If nfucn is only 16 you will need to use the share memory to store intermediate results. The simplest (fastest to implement) thing is to have m_iHeight x m_iWidth threads, each of it doing the 2 inner loops , with m_coefftemp[nfunc] as a temporary storage, and then have the results written to the gloabl memory at the end of the kernel.

Hello,

The problem is in a command a=a+b; for 32 times. So the results will be a+32b. Lets take the example of a warp. There are 32 thread. The memory read is per warp. Lets assume a=5 before the line. Now a warp issues a command to read a. They all get 5 , they all add b to 5. So instead of having in the end 5+32*b we will only have 5+b. Without atomicAdd the results will not be correct. The atomicAdd functions make sure that if a thread reads some value from a value, no other thread will read that address until the first thread finished the job. That means that each thread will wait for the other to finish. So if you have hundreds of threads try to do atomicAdd on one address they will all be waiting a long time. The solution for this is to declare a shared array, this way only the threads in a block will wait for each other. There is a good example of this in the book CUDA by example in the chapter with histograms. Here it is: http://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0 If you do not have access to book, just download the example codes and look for the ones for histograms.

The only thinkg you need to worry is the number of threads per block. These depend on home many resources are used (registers, shared memory) and in Fermi there can not be more than 1024. There is no communication between blocks, so if you choose a specific number of threads per block then the number of blocks has to be large enough so that there are enough threads to do the work. Your problem depends a lot on nfunc. If nfucn is only 16 you will need to use the share memory to store intermediate results. The simplest (fastest to implement) thing is to have m_iHeight x m_iWidth threads, each of it doing the 2 inner loops , with m_coefftemp[nfunc] as a temporary storage, and then have the results written to the gloabl memory at the end of the kernel.

You don't need to use atomic operations.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken

You don't need to use atomic operations.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken

Hi Saghi,

You don't need to use atomic operations.

If the expression "Le * CalculateSh(l,m,theta,phi) * sinf(theta) * (M_PI / m_iHeight) * (2.0f * M_PI / m_iWidth)" is a function of (i, j, l, m), and the expression has no side-effects, then map the indices onto a linear space. That merges all the for-loops into one for-loop of an integer, from which you recompute the values of i, j, l, and m. So, each thread of the CUDA grid corresponds to a unique value of i, j, l, and m. In the kernel, compute the expression, leaving the result in a large array partial_coef[i, j, l, m]. Finally, use reduction to compute the sums.

Alternatively, map (i, j, l) to a CUDA grid of (x, y, z) and compute within the kernel the innermost for-loop of m. Of course, in all solutions, you will need to move the computations of theta, phi, and Le into the kernel.

There are many, many ways to transform the loops, depending on your memory requirements. This will depend on the ranges of i, j, l, and m.

After you have a working solution, you can optimize the performance using different mappings, shared memory, etc.

Ken

[/quote]

Yes. The previous suggestion might be more appropriate. You can submit m_iHeight * m_iWidth * Nfuncn threads. You can retrieve the indices easy, by submitting 3D grids with 3D blocks. After there is still reduction to be done.

For example if one would have to do a sum of N elements with very large N, it would be done something like this:

One would submit N threads wgrouped in nbl (number of blocks) with tbl threads per block. Now each thread in each block would produce one element (though it can be more depending on problem) and put the value in a temporary array temp[tbl]. Now there would be a loop in which at each iteration the element ion the temp array would add ot itself the element tem[i+offset], where offset=tbl/2 at first iteration and then it would get halved every iteration until it gets to 1. This way each block would now give one element and we would have an array. Here there is a good article for reduction http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html also the book I suggested has good examples of reduction.

The threads can be organized in such a way that each thread would give up to 3 indices and each block up to 3 also, so you have 6 indexes you can play arround.

In general one should try to avoid atomicAdd, too many writes to the same address will slow down the code a lot. Instead one should use the shared memory as a temporary storage.

Hi Saghi,

You don't need to use atomic operations.

Ken

Yes. The previous suggestion might be more appropriate. You can submit m_iHeight * m_iWidth * Nfuncn threads. You can retrieve the indices easy, by submitting 3D grids with 3D blocks. After there is still reduction to be done.

For example if one would have to do a sum of N elements with very large N, it would be done something like this:

One would submit N threads wgrouped in nbl (number of blocks) with tbl threads per block. Now each thread in each block would produce one element (though it can be more depending on problem) and put the value in a temporary array temp[tbl]. Now there would be a loop in which at each iteration the element ion the temp array would add ot itself the element tem[i+offset], where offset=tbl/2 at first iteration and then it would get halved every iteration until it gets to 1. This way each block would now give one element and we would have an array. Here there is a good article for reduction http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html also the book I suggested has good examples of reduction.

The threads can be organized in such a way that each thread would give up to 3 indices and each block up to 3 also, so you have 6 indexes you can play arround.

In general one should try to avoid atomicAdd, too many writes to the same address will slow down the code a lot. Instead one should use the shared memory as a temporary storage.

just one more thing that makes me confuse is that I know I'll have m_iHeight * m_iWidth * Nfuncn threads but does it make difference how many threads per block I submit?

just one more thing that makes me confuse is that I know I'll have m_iHeight * m_iWidth * Nfuncn threads but does it make difference how many threads per block I submit?

[code]

int N = m_iWidth * m_iHeight * NFunc;

dim3 blockD(16,8,16);

dim3 gridD((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

[/code]

Then how can i find the correct index for i, j and l? is the following code correct?

[code]

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

int l = threadIdx.z;

[/code]

sorry if I'm asking basic questions, I'm very new to this concept and afraid of making mistakes.

Then how can i find the correct index for i, j and l? is the following code correct?

sorry if I'm asking basic questions, I'm very new to this concept and afraid of making mistakes.

You are on the good path, just a few remarks.

I an not sure if these declarations are correct:

[code]

dim3 blockD(16,8,16);

dim3 gridD((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

[/code]

I would rather use:

[code]

dim3 blockD=dim3(16,8,16);

dim3 gridD=dim3((blockD.x -1 + N)/blockD.x , (blockD.y -1 + N)/blockD.y, (blockD.z -1 + N)/blockD.z );

[/code]

or just somthing like dim3 blockD; and then blockD.x=16; and so on

See which one works

For Fermi (it can be different for other architectures) blockD.x*blockD.y*blockD.z<=1024. In your case blockD.x*blockD.y*blockD.z=2048

Next you are submitting lots of blocks doing duplicate work, because gridD=dim3((blockD.x -1 + N)/blockD.x , ((blockD.y -1 + N)/blockD.y), (blockD.z -1 + N)/blockD.z );

but in the kernel you do not use at all blockIdx.z.

so you can in fact use maybe, if your algorithm permits

[code]

dim3 blockD=dim3(16,8,1);

dim3 gridD=dim3((blockD.x -1 + Nx)/blockD.x , (blockD.y -1 + Ny)/blockD.y/, 16 ); // where Nx*Ny =N

[/code]

and in the kernel

[code]

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

int l = blockIdx.z;

[/code]

You are on the good path, just a few remarks.

I an not sure if these declarations are correct:

I would rather use:

or just somthing like dim3 blockD; and then blockD.x=16; and so on

See which one works

For Fermi (it can be different for other architectures) blockD.x*blockD.y*blockD.z<=1024. In your case blockD.x*blockD.y*blockD.z=2048

Next you are submitting lots of blocks doing duplicate work, because gridD=dim3((blockD.x -1 + N)/blockD.x , ((blockD.y -1 + N)/blockD.y), (blockD.z -1 + N)/blockD.z );

but in the kernel you do not use at all blockIdx.z.

so you can in fact use maybe, if your algorithm permits

and in the kernel

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4.

[/quote]

2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it's 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

[code]

#include <iostream>

#include <vector>

const int max_i = 2336;

const int max_j = 1752;

const int max_m = 16;

const int max_l = 4;

__global__ void kernel(bool * found, int total)

{

int i = threadIdx.x + blockIdx.x * blockDim.x;

int j = threadIdx.y + blockIdx.y * blockDim.y;

int z = threadIdx.z;

int l = z % max_l;

int m = z / max_l;

// check out of bounds.

if (i >= max_i)

return;

if (j >= max_j)

return;

if (l >= max_l)

return;

if (m >= max_m)

return;

int p = i * max_j * max_m * max_l

+ j * max_m * max_l

+ m * max_l

+ l;

if (p < total)

found[p] = true;

}

int main()

{

int total = max_i * max_j * max_m * max_l;

bool * found = (bool *) malloc(total);

std::cout << "total = " << total << "\n";

bool * d_found = 0;

cudaError_t e2 = cudaMalloc(&d_found, sizeof(bool) * total);

if (d_found == 0)

return 1;

if (e2 != 0)

return 1;

dim3 threads(16, 16);

int dim_x = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );

int dim_y = (max_i >> 4) + ( (max_i & 15) ? 1 : 0 );

int dim_z = max_m * max_l;

dim3 grid(dim_x, dim_y, dim_z);

kernel<<<grid, threads>>>(d_found, total);

cudaDeviceSynchronize();

cudaError_t e = cudaGetLastError();

if (e)

std::cout << cudaGetErrorString(e) << "\n";

void * ptr = (void*) found;

cudaMemcpy(ptr, d_found, 256, cudaMemcpyDeviceToHost);

for (int d = 0; d < 256; ++d)

std::cout << "d = " << d << " " << found[d] << "\n";

std::cout << "\n";

return 0;

}

[/code]

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that ">>4" code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There's going to be some unused threads, but that shouldn't be a problem. BTW, the kernel has returns. That isn't exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the "return;". Something to get use to.

Now for the m_coeff[] array... Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do "strip mining", i.e., do more work per thread, and represent partial sums with a smaller array. In that case, "l" could be mapped to z, and in that kernel, do a for-loop over "m". If that still is too big a space, then you could map "m" to z, and do a for-loop over "l". As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

Ken

Ok let me ask my question this way : suppose that I have an image with size 2336* 1752 and Nfun = 16 and lmax = 4.

2336 * 1752 * 16 * 4 = 261931008. On a 2.x NVIDIA GPU, the maximum number of threads in a block is 1024; on a 1.x GPU, it's 512. That means the resulting grid size is invalid if you want to represent this with one-dimension. (In C++ AMP, it would be possible, and the whole code including reduction is much easier to understand.)

So, I would represent this with a 3D grid, (i, j) mapped to (x, y) of the grid, and (l, m) mapped to z. The code to spawn the kernel on this space would look like the code shown below. (The following code just illustrates how you would spawn the kernel and how you would extract the indices from the kernel index space.)

Note the bounds checking in the kernel. There grid has to be a multiple of the block size (all that ">>4" code does the rounding up). Consequently, you have to include bounds checks to make sure only the proper threads access the data. There's going to be some unused threads, but that shouldn't be a problem. BTW, the kernel has returns. That isn't exactly OK, especially in C++ AMP and probably OpenCL. They should be rewritten without the "return;". Something to get use to.

Now for the m_coeff[] array... Since each float is 32-bits, you would have to allocate 4 * 261931008 on the GPU, so the cudaMalloc to represent m_coeff over all (i, j, l, m) would probably fail. Consequently, you may have to do "strip mining", i.e., do more work per thread, and represent partial sums with a smaller array. In that case, "l" could be mapped to z, and in that kernel, do a for-loop over "m". If that still is too big a space, then you could map "m" to z, and do a for-loop over "l". As I said before, there is many ways to do this. (BTW, always check return codes from CUDA runtime calls.)

Ken

[code]

theta = ((float(i) + 0.5f)/float(m_iHeight)) * M_PI;

sinf(theta)

[/code]

would turn into code like this

[code]

ratio = ((float(i) + 0.5f)/float(m_iHeight));

theta = ratio * M_PI;

sinpif (ratio);

[/code]

would turn into code like this

[quote]

In general, when a thread issues a series of writes to memory in a particular order,

other threads may see the effects of these memory writes in a different order.

__threadfence_block(), __threadfence(), and

__threadfence_system() can be used to enforce some ordering.

One use case is when threads consume some data produced by other threads as

illustrated by the following code sample of a kernel that computes the sum of an

array of N numbers in one call. Each block first sums a subset of the array and

stores the result in global memory. When all blocks are done, the last block done

reads each of these partial sums from global memory and sums them to obtain the

final result. In order to determine which block is finished last, each block atomically

increments a counter to signal that it is done with computing and storing its partial

sum (see Section B.11 about atomic functions). The last block is the one that

receives the counter value equal to gridDim.x-1. If no fence is placed between

storing the partial sum and incrementing the counter, the counter might increment

before the partial sum is stored and therefore, might reach gridDim.x-1 and let

the last block start reading partial sums before they have been actually updated in

memory.

[/quote]

[code]

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)

{

// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads

__threadfence();

// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));

}

// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone

__syncthreads();

if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;

}

[/code]

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

[code]

int Nx = width, Ny = height, Nz = 16;

dim3 blockD(16, 8, 1);

dim3 gridD((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y, Nz);

[/code]

and when I launch my kernel I get this error : invalid configuration argument.

As I've searched the error it seems that the grid or block size is not correct. Does anyone know about this error?

and when I launch my kernel I get this error : invalid configuration argument.

As I've searched the error it seems that the grid or block size is not correct. Does anyone know about this error?

Well now I'm stuck with this part, apparently I can't use 3D grid in Gforce 9800 GT and this is how I'm defining my grid size :

[code]

int Nx = width, Ny = height, Nz = 16;

dim3 blockD(16, 8, 1);

dim3 gridD((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y, Nz);

[/code]

and when I launch my kernel I get this error : invalid configuration argument.

As I've searched the error it seems that the grid or block size is not correct. Does anyone know about this error?

[/quote]

Unfortunately, your device is very old. You are getting an invalid config because there is no z-dimension in a grid for 1.x devices. In the CUDA Programmer Guide, check out Table A-1 and Table F-2. I'd recommend you set the block dimensions to (4, 4, 16) and the grid to ((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y). Then, in your kernel, make sure to compute the (i, j, l, m) I showed before. That should work. I have a 9800 also, and that gets past the invalid config problem. --Ken

Well now I'm stuck with this part, apparently I can't use 3D grid in Gforce 9800 GT and this is how I'm defining my grid size :

and when I launch my kernel I get this error : invalid configuration argument.

Unfortunately, your device is very old. You are getting an invalid config because there is no z-dimension in a grid for 1.x devices. In the CUDA Programmer Guide, check out Table A-1 and Table F-2. I'd recommend you set the block dimensions to (4, 4, 16) and the grid to ((blockD.x - 1 + Nx) / blockD.x, (blockD.y - 1 + Ny) / blockD.y). Then, in your kernel, make sure to compute the (i, j, l, m) I showed before. That should work. I have a 9800 also, and that gets past the invalid config problem. --Ken