Simply:
I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:
global
void afc_example
(
double *d1,*d2;
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt
)
{ /* j: outside loop; i: inside loop */
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art1,art2;
double phs;
double zr, zi;
while(j < Ns)
{ /* initialize */
for(i = Nt-1;i >= 0;i–)
{
phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art1 = afr[j]*zr + afl[j]*zi;
art2 = afr[j]*zi - afl[j]*zr;
d1[i] += art1;
d2[i] += art2;
} /* i */
j += blockDim.x*gridDim.x;
} /* j */
}
And here is my atomicAdd():
__device__ inline void atomicAdd(double *address, double value) {
unsigned long long oldval, newval, readback;
oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}
}
It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.
After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]
I think I still need to use atomicAdd(). Just have not figured it out yet.
I highly appreciate any thoughts or pointers. Thanks a lot.