Fail to assign value after initialization to 0

hi Folks:

I have successfully implemented a simple reduction example, and thanks for your, especially tera’s, help.

Here I am extending the example to a double loop case, where in the inner loop I sum over all entries of a vector. The outer loop just repeat the process, summing over all entries of the same vector, for Nt times, and then the sum will be returned to ith entry of the output vector. Here is my code:

#define blockSize 128

#include <stdio.h>

__device__ inline void atomicAdd(double *address, double value) 

{

    unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

    newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) 

        {

        oldval = readback;

        newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    }

}

//template <unsigned int blockSize>

__global__ void myreduce

(       

    double *g_odata, 

    double *g_idata, 

    double *g,

    unsigned int Nt,

    unsigned int Ns

) 

{

        __shared__ volatile double sdata[blockSize];

        unsigned volatile int tid = threadIdx.x;

        unsigned volatile int j = blockIdx.x*(blockSize*2) + tid;

        unsigned volatile int gridSize = blockSize*2*gridDim.x;

        double gt;

        sdata[tid] = 0;         

for (int i = Nt-1; i>=0; i--)

        {

                //g_odata[i] = 0;

                sdata[tid] = 0;

                tid = threadIdx.x;

                j = blockIdx.x*(blockSize*2) + tid;

                gridSize = blockSize*2*gridDim.x;

gt = g[i];

                __syncthreads();

while (j < Ns)

                {

                        sdata[tid] += gt*(g_idata[j] + g_idata[j+blockSize]); 

                        j += gridSize; 

                }

                __syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

               if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

               if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

               if (tid < 32)

               {       

                       if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

                       if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

                       if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

                       if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

                       if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

                       if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

               }

if (tid == 0) atomicAdd(&g_odata[i],sdata[0]);

        }

}

Basically, from kernel above, I kept getting [0 0 0], which is the initialization, for g_odata instead of something I assigned. Following is detailed description:

Say, my g_idata = [1 1 1 1 1]; and my w = [1 2 3]; then the outer loop iters for 3 times, with wt being w[1]=1, w[2]=2 and w[3]=3, respectively. Then for the inner loop, it sums w[i]g_idata, namely, it does sum(w[i][1 1 1 1 1]); so, for the first time, the double loop should return : sum(w[1][1 1 1 1 1])=5; for the 2nd time, it should return sum(w[2][1 1 1 1 1])=25=10; and for the 3rd time, it should return sum(w[3][1 1 1 1 1])=3*5=15; So the return of g_odata should be [5 10 15]

OK, here is what is happening: if I comment out

g_odata[i] = 0;

in the for loop, namely if g_odata[i] is not initilaized to 0, then I can get [5 10 15], which is right. But the problem is, when I call the kernel again, it returns [10 20 30], and if again, then [15 30 45]…

So I add the initialization line above to zero g_odata out every time the kernel is called. But the problem is, if I do the initialization like above, every time I got [0 0 0]! as if the atomicAdd() does not work!

So how do I get my right value with g_odata also initialized to 0? Also, is there any other way to put sdata[0] into g_odata without using atomicAdd()? Thanks a lot!