double loop with inner loop sum using reduction

nickxtsui · March 27, 2012, 6:39pm

hi Folks:

I have successfully implemented a simple reduction example, and thanks for your, especially tera’s, help.

Here I am extending the example to a double loop case, where in the inner loop I sum over all entries of a vector. The outer loop just repeat the process, summing over all entries of the same vector, for Nt times, and then the sum will be returned to ith entry of the output vector. Here is my code:

#define blockSize 128

#include <stdio.h>

__device__ inline void atomicAdd(double *address, double value) 

{

    unsigned long long oldval, newval, readback;

oldval = __double_as_longlong(*address);

    newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) 

        {

        oldval = readback;

        newval = __double_as_longlong(__longlong_as_double(oldval) + value);

    }

}

//template <unsigned int blockSize>

__global__ void myreduce

(       

    double *g_odata, 

    double *g_idata, 

    double *g,

    unsigned int Nt,

    unsigned int Ns

) 

{

        __shared__ volatile double sdata[blockSize];

        unsigned volatile int tid = threadIdx.x;

        unsigned volatile int j = blockIdx.x*(blockSize*2) + tid;

        unsigned volatile int gridSize = blockSize*2*gridDim.x;

        double gt;

        sdata[tid] = 0;		

	

	for (int i = Nt-1; i>=0; i--)

	{

		//g_odata[i] = 0;

		sdata[tid] = 0;

		tid = threadIdx.x;

		j = blockIdx.x*(blockSize*2) + tid;

		gridSize = blockSize*2*gridDim.x;

			

		gt = g[i];

		__syncthreads();

			

		while (j < Ns)

		{

			sdata[tid] += gt*(g_idata[j] + g_idata[j+blockSize]); 

			j += gridSize; 

		}

		__syncthreads();

		

               if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

               if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

               if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

               if (tid < 32)

               {       

                       if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

                       if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

                       if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

                       if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

                       if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

                       if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

               }

if (tid == 0) atomicAdd(&g_odata[i],sdata[0]);

	}

}

Say, my g_idata = [1 1 1 1 1]; and my w = [1 2 3]; then the outer loop iters for 3 times, with wt being w[1]=1, w[2]=2 and w[3]=3, respectively. Then for the inner loop, it sums w[i]g_idata, namely, it does sum(w[i][1 1 1 1 1]); so, for the first time, the double loop should return : sum(w[1][1 1 1 1 1])=5; for the 2nd time, it should return sum(w[2][1 1 1 1 1])=25=10; and for the 3rd time, it should return sum(w[3][1 1 1 1 1])=3*5=15; So the return of g_odata should be [5 10 15]

OK, here is what is happening: if I comment out

g_odata[i] = 0;

in the for loop, namely if g_odata[i] is not initilaized to 0, then I can get [5 10 15], which is right. But the problem is, when I call the kernel again, it returns [10 20 30], and if again, then [15 30 45]…

So I add the initialization line above to zero g_odata out every time the kernel is called. But the problem is, if I do the initialization like above, every time I got [0 0 0]! as if the atomicAdd() does not work!

So how do I get my right value with g_odata also initialized to 0? Also, is there any other way to put sdata[0] into g_odata without using atomicAdd()? Thanks a lot!

Gilles_C · March 29, 2012, 5:27am

Hi,

Three questions:

[*]What is gridDim?

[*]What is blockDim?

[*]What is “blockSize”?

Without these values, I have the uttermost difficulties to check the code’s validity.

nickxtsui · March 31, 2012, 1:29am

Sure. The blockSize is 128, and the gridDim is 1024; I do not see blockDim though. I assume it should be same as blockSize,128?

Thanks a lot in advance.

I also changed my question some.

Gilles_C · March 31, 2012, 6:23am

Hi,

Your g_odata being a global memory variable, if you initialise it to 0 into your kernel, all blocks will do it in an asynchronous manner. This means that your data might have already been accumulated in the result by a block, whereas another has not made the initialization to zero yet… And guess what, when it does it, it erases all your accumulated data.

TBH, I’m surprised you get [0,0,0] as final result when the initialization is done inside the loop, since you should get some partial accumulated values (unless most of your data to accumulate is actually zeros and thereafter, most of the partial results you get are indeed zeros).

Anyway, what you have to do is to initialize your g_odata array prior to call your kernel with cudaMemset(), like this:

cudaMemset(g_odata, 0, Nt*sizeof(double));

    myreduce<<<...>>>(g_odata, g_idata, g, Nt, Ns);

And BTW, mst of your "volatile"s are useless in the code and counter-productive. You should remove them. Actually, the only one that is useful is the one for declaring the shared memory, and even this one you should avoid, by declaring another pointer to double which would be volatile, and make it point to your shared memory and use it only when tid < 32 (only when you start avoiding __syncthreads).

HTH

pasoleatis · March 31, 2012, 10:27am

Hello,

I am wondering what are you trying to do. I am also using in my programs sum of arrays. Is there something wrong with the example you find in the SDK. Would like to have a code which just does a sum of an array or something more. You can open more topic, but will not help so much.

nickxtsui · April 2, 2012, 7:43pm

Hi,

Your g_odata being a global memory variable, if you initialise it to 0 into your kernel, all blocks will do it in an asynchronous manner. This means that your data might have already been accumulated in the result by a block, whereas another has not made the initialization to zero yet… And guess what, when it does it, it erases all your accumulated data.

TBH, I’m surprised you get [0,0,0] as final result when the initialization is done inside the loop, since you should get some partial accumulated values (unless most of your data to accumulate is actually zeros and thereafter, most of the partial results you get are indeed zeros).

Anyway, what you have to do is to initialize your g_odata array prior to call your kernel with cudaMemset(), like this:
cudaMemset(g_odata, 0, Nt*sizeof(double));

    myreduce<<<...>>>(g_odata, g_idata, g, Nt, Ns);
And BTW, mst of your "volatile"s are useless in the code and counter-productive. You should remove them. Actually, the only one that is useful is the one for declaring the shared memory, and even this one you should avoid, by declaring another pointer to double which would be volatile, and make it point to your shared memory and use it only when tid < 32 (only when you start avoiding __syncthreads).

HTH

Gilles_C:

Thank you so much for your explanation. I agree with you that using cudaMemset() can solve the problem, although I have not tried it out. It is because I am using MATLAB to call the kernel directly, so I guess there is no place to insert cudaMemset() in my code. The MATLAB code to initialize g_odata() to 0 is:

g_odata = gpuArray(zeros(Ns, Nt));

which is, time consuming. That is I wanted to zero out g_data in my kernel.

So, your comments give me this impression: it is not possible to initialize g_odata to 0 every iteration in the kernel before it is valued?

In the kernel, I want to: zero the g_odata first, then do my computation, then send the results to g_odata. Any way to do it?

Also thanks a lot for the volatile comments. Will remove them :)

In the kernel, I want to: zero out g_odata, then do my computation (sum over arrays like you mentioned), then sent results back to g_odata.

Gilles_C · April 3, 2012, 8:48am

Thank you so much for your explanation. I agree with you that using cudaMemset() can solve the problem, although I have not tried it out. It is because I am using MATLAB to call the kernel directly, so I guess there is no place to insert cudaMemset() in my code. The MATLAB code to initialize g_odata() to 0 is:
g_odata = gpuArray(zeros(Ns, Nt));
which is, time consuming. That is I wanted to zero out g_data in my kernel.

I’m a bit puzzled here: I don’t know much (anything) about Matlab and its GPU support, but since AFAICS g_odata is only a vector of dimension Nt, how comes you have to initialize it using both Ns and Nt as parameters? Wouldn’t zeros(1,Nt) or zeros(Nt, 1) (depending on the actual internal storage which I don’t know) be better and most probably about Ns times faster?

No it’s not since you would need a global synchronization inside your kernel, witch is not supported by CUDA.

nickxtsui · April 3, 2012, 10:39pm

Yes, it should be initialized as gpuArray(zeros(Nt,1)) eventually. But right now I am using a 2D array to save g_odata, then sum over the column. You are right, it should be setup to a vector with Nt entries. I guess my description is a little confusing.

I see. It seems I have to initialized outside kernel anyway.