I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

__global__
void afc_example
(
double *d1,*d2;
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt
)
{ /* j: outside loop; i: inside loop */
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art1,art2;
double phs;
double zr, zi;

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.

I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.

From your code I understand that you have a problem of NtxNs with two arrays of g[Nt] and g[Ns] which do not depend on d1 and d2 and you are trying to get d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)+afi[j]*sin(x[j]*g[i]/2.0) and d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)-afi[j]*sin(x[j]*g[i]/2.0)

First) My guess is that you want to do a Fourier transform of an array af=complex(afr,afi). The cufft library is very good at that. If I got it wrong, please put the math formula you are trying to calculate.

If this does not satisfy you then, next. This has a double loop anf maybe you can rewrite the problem a matrix vector multiplication in which d=M*af, where d=complex(d1,d2), af=complex(afr,afi) and M[i][j]=complex(cos(x[j]*g[i]/2.0),sin(x[j]*g[i]/2.0)).

Now if none of this apply then here is the advice. In practice you need to perform NtxNs operations. You can submit NtxNs threads which would compute just this:

[code]

phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art1 = afr[j]*zr + afl[j]*zi;
art2 = afr[j]*zi - afl[j]*zr;
[/code]
Now the problem would be to do the summation over j. So you define 2 shared array:

__shared__ double tmpd1[blocksize];
__shared__ double tmpd2[blocksize];
You use those to store the intermediate results;

so that you have now:

tmpd1[threadIdx.x] = art1;
tmpd2[threadIdx.x] = art2;
no you syncronize the threads in a block with :

_syncthreads();

Now that you have the values you need to sum them up on each block, I use a reduction algorithm

Firs let me see if I understood your problem correctly. From this code:

_global__ void afc_example( double *d1,*d2;double *afr, double *afl, const double *x, const double *g, const int Ns, const int Nt )

// .....

while(j < Ns)

{ /* initialize */

for(i = Nt-1;i >= 0;i--)

{

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;

} /* i */

j += blockDim.x*gridDim.x;

} /* j */

}

From your code I understand that you have a problem of NtxNs with two arrays of g[Nt] and g[Ns] which do not depend on d1 and d2 and you are trying to get d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)+afi[j]*sin(x[j]*g[i]/2.0) and d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)-afi[j]*sin(x[j]*g[i]/2.0)

First) My guess is that you want to do a Fourier transform of an array af=complex(afr,afi). The cufft library is very good at that. If I got it wrong, please put the math formula you are trying to calculate.

If this does not satisfy you then, next. This has a double loop anf maybe you can rewrite the problem a matrix vector multiplication in which d=M*af, where d=complex(d1,d2), af=complex(afr,afi) and M[i][j]=complex(cos(x[j]*g[i]/2.0),sin(x[j]*g[i]/2.0)).

Now if none of this apply then here is the advice. In practice you need to perform NtxNs operations. You can submit NtxNs threads which would compute just this:

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

Now the problem would be to do the summation over j. So you define 2 shared array:

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

You use those to store the intermediate results;

so that you have now:

tmpd1[threadIdx.x] = art1;

tmpd2[threadIdx.x] = art2;

no you syncronize the threads in a block with :

_syncthreads();

Now that you have the values you need to sum them up on each block, I use a reduction algorithm

__syncthreads();

for(int offset=bsl;offset>=1;offset=offset/2) //here I assume blockDim.x=2*bsl

{

if(tid<offset)

{

tempene[tid]=tempene[tid]+tempene[tid+offset]

;

}

__syncthreads();

}

if(tid==0)

{

// the resut is stored in tempene[0] and you can use atomic add to complete the sum to the

}

}

}

}

The saving from last you can do it with atomic line. Now you submit NtxNs threads grouped in blocks with bsize threads per block.

hi pasoleatis:
Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

What is the blocksize? Or do I just put "blocksize" there?

[quote name='nickxtsui' date='20 March 2012 - 07:53 PM' timestamp='1332273229' post='1385486']
hi pasoleatis:
Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

What is the blocksize? Or do I just put "blocksize" there?
[/quote]
blocksize is the numebr of threads per block

You calculation look to me a lot as a 1D transform FFT, or if you have enough memory you could put he problem as a matrix -vector multiplication.

But if those do not work, my idea is the following. You have NsxNt (or NtxNs) operations and then summation so that you are left with Ns numbers. In m algorithm you ubmit NsxNt threads which perform the operations. After that each block will make a partial summation, so in the end we are left with NsxNt/blocksize numebr which have to be reduced by summation to Ns numbers. so you submit NsxNt threads which get all the numbers, then the partial summation is made on each block, then you use atomic add to add the partial value to the d1[i] and d2[i].

This is not he smartest possible algorithm but it should work. In principle instead of atomic add something smarter can be done to reduce from NsxNt/blocksize to Ns numbers.

[quote name='nickxtsui' date='20 March 2012 - 07:53 PM' timestamp='1332273229' post='1385486']

hi pasoleatis:

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

What is the blocksize? Or do I just put "blocksize" there?

blocksize is the numebr of threads per block

You calculation look to me a lot as a 1D transform FFT, or if you have enough memory you could put he problem as a matrix -vector multiplication.

But if those do not work, my idea is the following. You have NsxNt (or NtxNs) operations and then summation so that you are left with Ns numbers. In m algorithm you ubmit NsxNt threads which perform the operations. After that each block will make a partial summation, so in the end we are left with NsxNt/blocksize numebr which have to be reduced by summation to Ns numbers. so you submit NsxNt threads which get all the numbers, then the partial summation is made on each block, then you use atomic add to add the partial value to the d1[i] and d2[i].

This is not he smartest possible algorithm but it should work. In principle instead of atomic add something smarter can be done to reduce from NsxNt/blocksize to Ns numbers.

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.
For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

__shared__ double tmpd1[Ns*Nt];

__shared__ double tmpd2[Ns*Nt];

I dont this is going to work. So what value is exactly for blocksize?

[quote name='nickxtsui' date='20 March 2012 - 09:06 PM' timestamp='1332277617' post='1385509']
Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.
For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

[quote name='nickxtsui' date='20 March 2012 - 09:06 PM' timestamp='1332277617' post='1385509']

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

__shared__ double tmpd1[Ns*Nt];

__shared__ double tmpd2[Ns*Nt];

I dont this is going to work. So what value is exactly for blocksize?

Hello,

You have to do the initialization in a different function.

blocksize=blockDim.x;

tid=threadIdx;

The reduction is just adding the numbers in pairs.

[quote name='nickxtsui' date='21 March 2012 - 03:09 PM' timestamp='1332342585' post='1385799']
OK. So if I have a column vector with length of 38271, what value my blocksize should be?
[/quote]
This depends on how much memory you have and how fast you code runs with different blocksizes. You can use if you have the cc 2.0 up to 1024. You can make a call for each i and have a very small memory usage. or if you have enough memory you can make a call for all i. this will results in a mtrix of Nsx32. If Ns is below 4096 it should fir in 3 or 6 GB.

[quote name='nickxtsui' date='21 March 2012 - 03:09 PM' timestamp='1332342585' post='1385799']

OK. So if I have a column vector with length of 38271, what value my blocksize should be?

This depends on how much memory you have and how fast you code runs with different blocksizes. You can use if you have the cc 2.0 up to 1024. You can make a call for each i and have a very small memory usage. or if you have enough memory you can make a call for all i. this will results in a mtrix of Nsx32. If Ns is below 4096 it should fir in 3 or 6 GB.

I am doing an example and right now I have successfully tested summing a vector that has 16 elements. Code is like this:
[code]
#include<iostream>
#include<stdio.h>
#define BLOCKSIZE 16

__device__ inline void atomicAdd(double *address, double value) {
unsigned long long oldval, newval, readback;

What I really want is to extend this program to use more than one block, with each block having more threads, regardless how much memory I have. I have no idea how to code it. I define the block size to be 16, and right now I think I only have 1 block.
If I sum over a long vector, it will use more blocks? and the length of the vector does not have to be the power of 2. That is something I don't know how to solve.

What I really want is to extend this program to use more than one block, with each block having more threads, regardless how much memory I have. I have no idea how to code it. I define the block size to be 16, and right now I think I only have 1 block.

If I sum over a long vector, it will use more blocks? and the length of the vector does not have to be the power of 2. That is something I don't know how to solve.

The size of a warp is 32. With 16 threads per block you can not even fill one warp. I recommend something between 128 and 1024.

Your code is good except of atomicAdd(&g_output[idx], s_data[0] ); you should have anther index idxout=(blockIdx.x * blockDim.x + threadIdx.x)/BLOCKSIZE;

and then atomicAdd(&g_output[idxout], s_data[0] );

This way all the blocks with index blockIdx.x between 0 and BLOCKSIZE-1 will compute the partial sums for i=0, then all blocks with blockIdx.x betwen BLOCKSIZE and 2*BLOCKSIZE-1 will compute the value i=1 and so on.

In main code your submission should be
agpuhelloworld<<<Ns*Nt/BLOCKSIZE>>>(g_output, g_input);

I assume g_output of size Ns and g_input of size Nt.

This code will work but it will be slow for large Nt;

The size of a warp is 32. With 16 threads per block you can not even fill one warp. I recommend something between 128 and 1024.

Your code is good except of atomicAdd(&g_output[idx], s_data[0] ); you should have anther index idxout=(blockIdx.x * blockDim.x + threadIdx.x)/BLOCKSIZE;

and then atomicAdd(&g_output[idxout], s_data[0] );

This way all the blocks with index blockIdx.x between 0 and BLOCKSIZE-1 will compute the partial sums for i=0, then all blocks with blockIdx.x betwen BLOCKSIZE and 2*BLOCKSIZE-1 will compute the value i=1 and so on.

Well. Try to compile the code. See if it works. You defined BLOCKSIZE, but you did not put the line which you use to call the function, the one one in the main function. From this line if

Well. Try to compile the code. See if it works. You defined BLOCKSIZE, but you did not put the line which you use to call the function, the one one in the main function. From this line if

(threadIdx.x <78654)

s_data[threadIdx.x] = g_input[idx];

__syncthreads( );

I assume that you expect for a block to run with 78654 threads, that will not happen with any of the current nvidia gpu.

a = feval(gpuadd, a,b, N);
out = gather(a);
[/code]

The input, b, to MATLAB function, which is equivalent to the input to kernel, g_idata, is created by MATLAB code:
[code]
b = ones(78765,1)';
[/code]
which is basically a column vector with 78765 entries. And all the entries are 1.

So I figure if the whole things goes right , then what comes out should be the sum of the vector, 78765 of 1s, 1+1+1...+1 = 78765; But I got wierd result : 3.1913*10^31
although the .cu file can be compiled and the kernel can be created without any problem. Where I messed it up? Any pointers?
Thanks a lot.

The input, b, to MATLAB function, which is equivalent to the input to kernel, g_idata, is created by MATLAB code:

b = ones(78765,1)';

which is basically a column vector with 78765 entries. And all the entries are 1.

So I figure if the whole things goes right , then what comes out should be the sum of the vector, 78765 of 1s, 1+1+1...+1 = 78765; But I got wierd result : 3.1913*10^31

although the .cu file can be compiled and the kernel can be created without any problem. Where I messed it up? Any pointers?

I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

__global__

void afc_example

(

double *d1,*d2;

double *afr,

double *afl,

const double *x,

const double *g,

const int Ns,

const int Nt

)

{ /* j: outside loop; i: inside loop */

int j = blockIdx.x*blockDim.x + threadIdx.x;

int i;

double art1,art2;

double phs;

double zr, zi;

while(j < Ns)

{ /* initialize */

for(i = Nt-1;i >= 0;i--)

{

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;

} /* i */

j += blockDim.x*gridDim.x;

} /* j */

}

And here is my atomicAdd():

[code]

__device__ inline void atomicAdd(double *address, double value) {

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

[/code]

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.

I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

__global__

void afc_example

(

double *d1,*d2;

double *afr,

double *afl,

const double *x,

const double *g,

const int Ns,

const int Nt

)

{ /* j: outside loop; i: inside loop */

int j = blockIdx.x*blockDim.x + threadIdx.x;

int i;

double art1,art2;

double phs;

double zr, zi;

while(j < Ns)

{ /* initialize */

for(i = Nt-1;i >= 0;i--)

{

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;

} /* i */

j += blockDim.x*gridDim.x;

} /* j */

}

And here is my atomicAdd():

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.

Firs let me see if I understood your problem correctly. From this code:

[code]

_global__ void afc_example( double *d1,*d2;double *afr, double *afl, const double *x, const double *g, const int Ns, const int Nt )

// .....

while(j < Ns)

{ /* initialize */

for(i = Nt-1;i >= 0;i--)

{

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;

} /* i */

j += blockDim.x*gridDim.x;

} /* j */

}

[/code]

From your code I understand that you have a problem of NtxNs with two arrays of g[Nt] and g[Ns] which do not depend on d1 and d2 and you are trying to get d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)+afi[j]*sin(x[j]*g[i]/2.0) and d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)-afi[j]*sin(x[j]*g[i]/2.0)

First) My guess is that you want to do a Fourier transform of an array af=complex(afr,afi). The cufft library is very good at that. If I got it wrong, please put the math formula you are trying to calculate.

If this does not satisfy you then, next. This has a double loop anf maybe you can rewrite the problem a matrix vector multiplication in which d=M*af, where d=complex(d1,d2), af=complex(afr,afi) and M[i][j]=complex(cos(x[j]*g[i]/2.0),sin(x[j]*g[i]/2.0)).

Now if none of this apply then here is the advice. In practice you need to perform NtxNs operations. You can submit NtxNs threads which would compute just this:

[code]

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

[/code]

Now the problem would be to do the summation over j. So you define 2 shared array:

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

You use those to store the intermediate results;

so that you have now:

tmpd1[threadIdx.x] = art1;

tmpd2[threadIdx.x] = art2;

no you syncronize the threads in a block with :

_syncthreads();

Now that you have the values you need to sum them up on each block, I use a reduction algorithm

[code]

__syncthreads();

for(int offset=bsl;offset>=1;offset=offset/2) //here I assume blockDim.x=2*bsl

{

if(tid<offset)

{

tempene[tid]=tempene[tid]+tempene[tid+offset]

;

}

__syncthreads();

}

if(tid==0)

{

// the resut is stored in tempene[0] and you can use atomic add to complete the sum to the

}

}

}

}

[/code]

The saving from last you can do it with atomic line. Now you submit NtxNs threads grouped in blocks with bsize threads per block.

Firs let me see if I understood your problem correctly. From this code:

From your code I understand that you have a problem of NtxNs with two arrays of g[Nt] and g[Ns] which do not depend on d1 and d2 and you are trying to get d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)+afi[j]*sin(x[j]*g[i]/2.0) and d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)-afi[j]*sin(x[j]*g[i]/2.0)

First) My guess is that you want to do a Fourier transform of an array af=complex(afr,afi). The cufft library is very good at that. If I got it wrong, please put the math formula you are trying to calculate.

If this does not satisfy you then, next. This has a double loop anf maybe you can rewrite the problem a matrix vector multiplication in which d=M*af, where d=complex(d1,d2), af=complex(afr,afi) and M[i][j]=complex(cos(x[j]*g[i]/2.0),sin(x[j]*g[i]/2.0)).

Now if none of this apply then here is the advice. In practice you need to perform NtxNs operations. You can submit NtxNs threads which would compute just this:

Now the problem would be to do the summation over j. So you define 2 shared array:

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

You use those to store the intermediate results;

so that you have now:

tmpd1[threadIdx.x] = art1;

tmpd2[threadIdx.x] = art2;

no you syncronize the threads in a block with :

_syncthreads();

Now that you have the values you need to sum them up on each block, I use a reduction algorithm

The saving from last you can do it with atomic line. Now you submit NtxNs threads grouped in blocks with bsize threads per block.

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

[code]

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

[/code]

What is the blocksize? Or do I just put "blocksize" there?

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

What is the blocksize? Or do I just put "blocksize" there?

hi pasoleatis:

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

[code]

__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];

[/code]

What is the blocksize? Or do I just put "blocksize" there?

[/quote]

blocksize is the numebr of threads per block

You calculation look to me a lot as a 1D transform FFT, or if you have enough memory you could put he problem as a matrix -vector multiplication.

But if those do not work, my idea is the following. You have NsxNt (or NtxNs) operations and then summation so that you are left with Ns numbers. In m algorithm you ubmit NsxNt threads which perform the operations. After that each block will make a partial summation, so in the end we are left with NsxNt/blocksize numebr which have to be reduced by summation to Ns numbers. so you submit NsxNt threads which get all the numbers, then the partial summation is made on each block, then you use atomic add to add the partial value to the d1[i] and d2[i].

This is not he smartest possible algorithm but it should work. In principle instead of atomic add something smarter can be done to reduce from NsxNt/blocksize to Ns numbers.

hi pasoleatis:

Thank you so much for carefully reviewing my questions.

Thank you so much.

BTW: when you declare

What is the blocksize? Or do I just put "blocksize" there?

blocksize is the numebr of threads per block

You calculation look to me a lot as a 1D transform FFT, or if you have enough memory you could put he problem as a matrix -vector multiplication.

But if those do not work, my idea is the following. You have NsxNt (or NtxNs) operations and then summation so that you are left with Ns numbers. In m algorithm you ubmit NsxNt threads which perform the operations. After that each block will make a partial summation, so in the end we are left with NsxNt/blocksize numebr which have to be reduced by summation to Ns numbers. so you submit NsxNt threads which get all the numbers, then the partial summation is made on each block, then you use atomic add to add the partial value to the d1[i] and d2[i].

This is not he smartest possible algorithm but it should work. In principle instead of atomic add something smarter can be done to reduce from NsxNt/blocksize to Ns numbers.

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

[code]

__shared__ double tmpd1[Ns*Nt];

__shared__ double tmpd2[Ns*Nt];

[/code]

I dont this is going to work. So what value is exactly for blocksize?

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

I dont this is going to work. So what value is exactly for blocksize?

[code]

if(tid<offset)

[/code]

I am really confused why [b] if(tid<offset)[/b] would give me so much trouble......

I assume

[code]

tid = threadIdx.x;

[/code]

and

[code]

offset = blockDim.x/2;

[/code]

I am really confused why

if(tid<offset)would give me so much trouble......I assume

and

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

[code]

__shared__ double tmpd1[Ns*Nt];

__shared__ double tmpd2[Ns*Nt];

[/code]

I dont this is going to work. So what value is exactly for blocksize?

[/quote]

Hello,

You have to do the initialization in a different function.

blocksize=blockDim.x;

tid=threadIdx;

The reduction is just adding the numbers in pairs.

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.

I dont this is going to work. So what value is exactly for blocksize?

Hello,

You have to do the initialization in a different function.

blocksize=blockDim.x;

tid=threadIdx;

The reduction is just adding the numbers in pairs.

OK. So if I have a column vector with length of 38271, what value my blocksize should be?

[/quote]

This depends on how much memory you have and how fast you code runs with different blocksizes. You can use if you have the cc 2.0 up to 1024. You can make a call for each i and have a very small memory usage. or if you have enough memory you can make a call for all i. this will results in a mtrix of Nsx32. If Ns is below 4096 it should fir in 3 or 6 GB.

OK. So if I have a column vector with length of 38271, what value my blocksize should be?

This depends on how much memory you have and how fast you code runs with different blocksizes. You can use if you have the cc 2.0 up to 1024. You can make a call for each i and have a very small memory usage. or if you have enough memory you can make a call for all i. this will results in a mtrix of Nsx32. If Ns is below 4096 it should fir in 3 or 6 GB.

[code]

#include<iostream>

#include<stdio.h>

#define BLOCKSIZE 16

__device__ inline void atomicAdd(double *address, double value) {

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

__global__ void agpuhelloworld(double* g_output, double* g_input)

{

__shared__ double s_data[BLOCKSIZE];

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (threadIdx.x <16)

s_data[threadIdx.x] = g_input[idx];

__syncthreads( );

// compute sum for the threadblock

for ( int dist = BLOCKSIZE/2; dist > 0; dist /= 2 )

{

if ( threadIdx.x < dist )

s_data[threadIdx.x] += s_data[ threadIdx.x + dist ];

__syncthreads( );

}

printf("delete successful");

if ( threadIdx.x == 0 )

{

atomicAdd(&g_output[idx], s_data[0] );

}

}

[/code]

What I really want is to extend this program to use more than one block, with each block having more threads, regardless how much memory I have. I have no idea how to code it. I define the block size to be 16, and right now I think I only have 1 block.

If I sum over a long vector, it will use more blocks? and the length of the vector does not have to be the power of 2. That is something I don't know how to solve.

What I really want is to extend this program to use more than one block, with each block having more threads, regardless how much memory I have. I have no idea how to code it. I define the block size to be 16, and right now I think I only have 1 block.

If I sum over a long vector, it will use more blocks? and the length of the vector does not have to be the power of 2. That is something I don't know how to solve.

Your code is good except of atomicAdd(&g_output[idx], s_data[0] ); you should have anther index idxout=(blockIdx.x * blockDim.x + threadIdx.x)/BLOCKSIZE;

and then atomicAdd(&g_output[idxout], s_data[0] );

This way all the blocks with index blockIdx.x between 0 and BLOCKSIZE-1 will compute the partial sums for i=0, then all blocks with blockIdx.x betwen BLOCKSIZE and 2*BLOCKSIZE-1 will compute the value i=1 and so on.

In main code your submission should be

agpuhelloworld<<<Ns*Nt/BLOCKSIZE>>>(g_output, g_input);

I assume g_output of size Ns and g_input of size Nt.

This code will work but it will be slow for large Nt;

Your code is good except of atomicAdd(&g_output[idx], s_data[0] ); you should have anther index idxout=(blockIdx.x * blockDim.x + threadIdx.x)/BLOCKSIZE;

and then atomicAdd(&g_output[idxout], s_data[0] );

This way all the blocks with index blockIdx.x between 0 and BLOCKSIZE-1 will compute the partial sums for i=0, then all blocks with blockIdx.x betwen BLOCKSIZE and 2*BLOCKSIZE-1 will compute the value i=1 and so on.

In main code your submission should be

agpuhelloworld<<<Ns*Nt/BLOCKSIZE>>>(g_output, g_input);

I assume g_output of size Ns and g_input of size Nt.

This code will work but it will be slow for large Nt;

[code]

#include<iostream>

#include<stdio.h>

#define BLOCKSIZE 78654

__device__ inline void atomicAdd(double *address, double value) {

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

__global__ void agpuhelloworld(double* g_output, double* g_input)

{

__shared__ double s_data[BLOCKSIZE];

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (threadIdx.x <78654)

s_data[threadIdx.x] = g_input[idx];

__syncthreads( );

// compute sum for the threadblock

for ( int dist = BLOCKSIZE/2; dist > 0; dist /= 2 )

{

if ( threadIdx.x < dist )

s_data[threadIdx.x] += s_data[ threadIdx.x + dist ];

__syncthreads( );

}

printf("delete successful");

if ( threadIdx.x == 0 )

{

atomicAdd(&g_output[idx], s_data[0] );

}

}

[/code]

(threadIdx.x <78654)

s_data[threadIdx.x] = g_input[idx];

__syncthreads( );

I assume that you expect for a block to run with 78654 threads, that will not happen with any of the current nvidia gpu.

(threadIdx.x <78654)

s_data[threadIdx.x] = g_input[idx];

__syncthreads( );

I assume that you expect for a block to run with 78654 threads, that will not happen with any of the current nvidia gpu.

[code]

#define blockSize 256

#include <stdio.h>

__device__ inline void atomicAdd(double *address, double value) {

unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

//template <unsigned int blockSize>

__global__ void myreduce(double *g_odata, double *g_idata, unsigned int n)

{

__shared__ double sdata[blockSize];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + tid;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n)

{

sdata[tid] += g_idata[i] + g_idata[i+blockSize];

i += gridSize;

}

__syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32)

{

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if (blockSize >= 8) sdata[tid] += sdata[tid + 4];

if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

//if (tid == 0) g_odata[blockIdx.x] = sdata[0];

if (tid == 0) atomicAdd(&g_odata[0], sdata[0]);

}

[/code]

The function that calls this kernel is a [b]MATLAB [/b]script:

[code]

[size="4"]function out= amyreduce_cuda(b)

persistent gpuadd;

persistent a;

N = length(b);

if isempty(gpuadd)

disp 'Initializing GPU drf calculation...';

gpuadd = parallel.gpu.CUDAKernel('myreduce.ptx','myreduce.cu','myreduce');

tmp = gpuDevice;

gpuadd.GridSize = [min(1024,tmp.MaxGridSize(1)) 1]; % min(1024, maxblocks)

gpuadd.ThreadBlockSize = [min(128,tmp.MaxThreadsPerBlock) 1 1]; % min(128, maxthreadsperblock)

a = gpuArray(zeros(5,1));

end;

a = feval(gpuadd, a,b, N);

out = gather(a);

[/code]

The input, b, to MATLAB function, which is equivalent to the input to kernel, g_idata, is created by MATLAB code:

[code]

b = ones(78765,1)';

[/code]

which is basically a column vector with 78765 entries. And all the entries are 1.

So I figure if the whole things goes right , then what comes out should be the sum of the vector, 78765 of 1s, 1+1+1...+1 = 78765; But I got wierd result : 3.1913*10^31

although the .cu file can be compiled and the kernel can be created without any problem. Where I messed it up? Any pointers?

Thanks a lot.

The function that calls this kernel is a

MATLABscript:The input, b, to MATLAB function, which is equivalent to the input to kernel, g_idata, is created by MATLAB code:

which is basically a column vector with 78765 entries. And all the entries are 1.

So I figure if the whole things goes right , then what comes out should be the sum of the vector, 78765 of 1s, 1+1+1...+1 = 78765; But I got wierd result : 3.1913*10^31

although the .cu file can be compiled and the kernel can be created without any problem. Where I messed it up? Any pointers?

Thanks a lot.