Same Code (really, it is) - Much Different Results
  1 / 3    
Hi All,

Smoke coming out of my ears; Have been struggling with a problem for over a month now.
Desperately in need of expert advice; Please help if you can smile.gif.

I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.

The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).
This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.

The values at a specific point in the array are recorded, and this is the output of the algorithm.

This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.
(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).

The CPU code works like a charm.
The GPU code, however, produces similar results but with a linearly growing error.
That is, a scalar value is somehow recursively being added to each sample in the output.

Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.

Obviously, looking for some variable that is not cleared etc.. is the first thing I did.
Nothing there. I promise. The codes are identical (didn't want to make this post too heavy, but I can strip it down and upload if needed).

Does anybody have a clue why something like this could be happening?

Any help would be highly appreciated.

Cheers,

Jon.
Hi All,



Smoke coming out of my ears; Have been struggling with a problem for over a month now.

Desperately in need of expert advice; Please help if you can smile.gif.



I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.



The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).

This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.



The values at a specific point in the array are recorded, and this is the output of the algorithm.



This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.

(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).



The CPU code works like a charm.

The GPU code, however, produces similar results but with a linearly growing error.

That is, a scalar value is somehow recursively being added to each sample in the output.



Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.



Obviously, looking for some variable that is not cleared etc.. is the first thing I did.

Nothing there. I promise. The codes are identical (didn't want to make this post too heavy, but I can strip it down and upload if needed).



Does anybody have a clue why something like this could be happening?



Any help would be highly appreciated.



Cheers,



Jon.

#1
Posted 09/28/2010 09:09 PM   
Hi All,

Smoke coming out of my ears; Have been struggling with a problem for over a month now.
Desperately in need of expert advice; Please help if you can smile.gif.

I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.

The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).
This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.

The values at a specific point in the array are recorded, and this is the output of the algorithm.

This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.
(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).

The CPU code works like a charm.
The GPU code, however, produces similar results but with a linearly growing error.
That is, a scalar value is somehow recursively being added to each sample in the output.

Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.

Obviously, looking for some variable that is not cleared etc.. is the first thing I did.
Nothing there. I promise. The codes are identical (didn't want to make this post too heavy, but I can strip it down and upload if needed).

Does anybody have a clue why something like this could be happening?

Any help would be highly appreciated.

Cheers,

Jon.
Hi All,



Smoke coming out of my ears; Have been struggling with a problem for over a month now.

Desperately in need of expert advice; Please help if you can smile.gif.



I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.



The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).

This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.



The values at a specific point in the array are recorded, and this is the output of the algorithm.



This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.

(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).



The CPU code works like a charm.

The GPU code, however, produces similar results but with a linearly growing error.

That is, a scalar value is somehow recursively being added to each sample in the output.



Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.



Obviously, looking for some variable that is not cleared etc.. is the first thing I did.

Nothing there. I promise. The codes are identical (didn't want to make this post too heavy, but I can strip it down and upload if needed).



Does anybody have a clue why something like this could be happening?



Any help would be highly appreciated.



Cheers,



Jon.

#2
Posted 09/28/2010 09:09 PM   
You probably write somewhere a[i]=i instead of a[i]=c[i]
You probably write somewhere a[i]=i instead of a[i]=c[i]

#3
Posted 09/29/2010 03:11 PM   
You probably write somewhere a[i]=i instead of a[i]=c[i]
You probably write somewhere a[i]=i instead of a[i]=c[i]

#4
Posted 09/29/2010 03:11 PM   
[quote name='Lev' post='1124060' date='Sep 29 2010, 04:11 PM']You probably write somewhere a[i]=i instead of a[i]=c[i][/quote]

Nope. Searched my code upside down... :)

Any other ideas?
[quote name='Lev' post='1124060' date='Sep 29 2010, 04:11 PM']You probably write somewhere a[i]=i instead of a[i]=c[i]



Nope. Searched my code upside down... :)



Any other ideas?

#5
Posted 09/29/2010 05:42 PM   
[quote name='Lev' post='1124060' date='Sep 29 2010, 04:11 PM']You probably write somewhere a[i]=i instead of a[i]=c[i][/quote]

Nope. Searched my code upside down... :)

Any other ideas?
[quote name='Lev' post='1124060' date='Sep 29 2010, 04:11 PM']You probably write somewhere a[i]=i instead of a[i]=c[i]



Nope. Searched my code upside down... :)



Any other ideas?

#6
Posted 09/29/2010 05:42 PM   
post your code at last
post your code at last

#7
Posted 09/29/2010 05:55 PM   
post your code at last
post your code at last

#8
Posted 09/29/2010 05:55 PM   
[quote name='Lev' post='1124139' date='Sep 29 2010, 06:55 PM']post your code at last[/quote]

OK.

I am processing a 3D array slice by slice.
Here's the kernel:

[code]__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)
{

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

double wR = 1.0;
double pCoef = 1.0/3.0;


// Solve for pressure
if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
// We are inside the domain
p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];


}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

// Inject Source
if ((i == si) && (j == sj) && (k == sk)) {
p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];
}

// Collect at Reciever
if ((i == ri) && (j == rj) && (k == rk)) {
ir[n] = p[recPos]; }

// Wait for all
__syncthreads();


}[/code]

with the striding macros defined as:

[code]#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))
#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))
#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))
#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))
#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))
#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))[/code]

And on the host side:

[code] // Loop through time
for (int n=2; n<ns; n++) {

// Execute Pressure Kernel

cudaPrintfInit();

// Loop through k-dimension tiles
for (int k=0; k<kDim; k++)
{

pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

cutilCheckMsg("Kernel execution failed\n");
cudaPrintfDisplay(stdout, true);

}

cudaThreadSynchronize();
cudaPrintfEnd();

// Update temporal memory
cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);
cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);
cudaMemset((void**)pCuda, 0, dataSize);

// Status
printf("GPU Iteration n = %i out of %i \r", n, ns);

} // end n[/code]


And this is the reference code for the CPU:

[code] // Run the Simulation on CPU
for (int n=2; n<ns; n++)
{
for (int i=0; i<iDim; i++)
{
for (int j=0; j<jDim; j++)
{
for (int k=0; k<kDim; k++)
{

//// Where are we?
if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
// We are inside the domain
p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];
}

if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; } // Down Boundary

if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; } // Front Boundary

if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; } // Left Boundary

if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary


} // end k
} // end j
} // end i

// Insert Source
p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

// Collect at Receiver
ir[n] = p[ri][rj][rk];



// Update temporal memory
for (int i=0; i<iDim; i++)
{
for (int j=0; j<jDim; j++)
{
for (int k=0; k<kDim; k++)
{
pzz[i][j][k] = pz[i][j][k];
pz[i][j][k] = p[i][j][k];
}
}
}

// Status
printf("Iteration n = %i out of %i \r", n, ns);

} // end n[/code]
[quote name='Lev' post='1124139' date='Sep 29 2010, 06:55 PM']post your code at last



OK.



I am processing a 3D array slice by slice.

Here's the kernel:



__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{



int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;



double wR = 1.0;

double pCoef = 1.0/3.0;





// Solve for pressure

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];





}



if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary



if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary



if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary



if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary



if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary



if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary



// Inject Source

if ((i == si) && (j == sj) && (k == sk)) {

p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

}



// Collect at Reciever

if ((i == ri) && (j == rj) && (k == rk)) {

ir[n] = p[recPos]; }



// Wait for all

__syncthreads();





}




with the striding macros defined as:



#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))




And on the host side:



// Loop through time

for (int n=2; n<ns; n++) {



// Execute Pressure Kernel



cudaPrintfInit();



// Loop through k-dimension tiles

for (int k=0; k<kDim; k++)

{



pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);



cutilCheckMsg("Kernel execution failed\n");

cudaPrintfDisplay(stdout, true);



}



cudaThreadSynchronize();

cudaPrintfEnd();



// Update temporal memory

cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemset((void**)pCuda, 0, dataSize);



// Status

printf("GPU Iteration n = %i out of %i \r", n, ns);



} // end n






And this is the reference code for the CPU:



// Run the Simulation on CPU

for (int n=2; n<ns; n++)

{

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{



//// Where are we?

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

}



if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; } // Down Boundary



if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary



if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; } // Front Boundary



if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary



if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; } // Left Boundary



if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary





} // end k

} // end j

} // end i



// Insert Source

p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];



// Collect at Receiver

ir[n] = p[ri][rj][rk];







// Update temporal memory

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

pzz[i][j][k] = pz[i][j][k];

pz[i][j][k] = p[i][j][k];

}

}

}



// Status

printf("Iteration n = %i out of %i \r", n, ns);



} // end n

#9
Posted 09/29/2010 06:17 PM   
[quote name='Lev' post='1124139' date='Sep 29 2010, 06:55 PM']post your code at last[/quote]

OK.

I am processing a 3D array slice by slice.
Here's the kernel:

[code]__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)
{

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

double wR = 1.0;
double pCoef = 1.0/3.0;


// Solve for pressure
if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
// We are inside the domain
p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];


}

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

// Inject Source
if ((i == si) && (j == sj) && (k == sk)) {
p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];
}

// Collect at Reciever
if ((i == ri) && (j == rj) && (k == rk)) {
ir[n] = p[recPos]; }

// Wait for all
__syncthreads();


}[/code]

with the striding macros defined as:

[code]#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))
#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))
#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))
#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))
#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))
#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))
#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))[/code]

And on the host side:

[code] // Loop through time
for (int n=2; n<ns; n++) {

// Execute Pressure Kernel

cudaPrintfInit();

// Loop through k-dimension tiles
for (int k=0; k<kDim; k++)
{

pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

cutilCheckMsg("Kernel execution failed\n");
cudaPrintfDisplay(stdout, true);

}

cudaThreadSynchronize();
cudaPrintfEnd();

// Update temporal memory
cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);
cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);
cudaMemset((void**)pCuda, 0, dataSize);

// Status
printf("GPU Iteration n = %i out of %i \r", n, ns);

} // end n[/code]


And this is the reference code for the CPU:

[code] // Run the Simulation on CPU
for (int n=2; n<ns; n++)
{
for (int i=0; i<iDim; i++)
{
for (int j=0; j<jDim; j++)
{
for (int k=0; k<kDim; k++)
{

//// Where are we?
if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))
{
// We are inside the domain
p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];
}

if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; } // Down Boundary

if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; } // Front Boundary

if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; } // Left Boundary

if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary


} // end k
} // end j
} // end i

// Insert Source
p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

// Collect at Receiver
ir[n] = p[ri][rj][rk];



// Update temporal memory
for (int i=0; i<iDim; i++)
{
for (int j=0; j<jDim; j++)
{
for (int k=0; k<kDim; k++)
{
pzz[i][j][k] = pz[i][j][k];
pz[i][j][k] = p[i][j][k];
}
}
}

// Status
printf("Iteration n = %i out of %i \r", n, ns);

} // end n[/code]
[quote name='Lev' post='1124139' date='Sep 29 2010, 06:55 PM']post your code at last



OK.



I am processing a 3D array slice by slice.

Here's the kernel:



__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{



int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;



double wR = 1.0;

double pCoef = 1.0/3.0;





// Solve for pressure

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];





}



if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary



if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary



if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary



if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary



if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary



if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary



// Inject Source

if ((i == si) && (j == sj) && (k == sk)) {

p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

}



// Collect at Reciever

if ((i == ri) && (j == rj) && (k == rk)) {

ir[n] = p[recPos]; }



// Wait for all

__syncthreads();





}




with the striding macros defined as:



#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))




And on the host side:



// Loop through time

for (int n=2; n<ns; n++) {



// Execute Pressure Kernel



cudaPrintfInit();



// Loop through k-dimension tiles

for (int k=0; k<kDim; k++)

{



pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);



cutilCheckMsg("Kernel execution failed\n");

cudaPrintfDisplay(stdout, true);



}



cudaThreadSynchronize();

cudaPrintfEnd();



// Update temporal memory

cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

cudaMemset((void**)pCuda, 0, dataSize);



// Status

printf("GPU Iteration n = %i out of %i \r", n, ns);



} // end n






And this is the reference code for the CPU:



// Run the Simulation on CPU

for (int n=2; n<ns; n++)

{

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{



//// Where are we?

if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

{

// We are inside the domain

p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

}



if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; } // Down Boundary



if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary



if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; } // Front Boundary



if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary



if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; } // Left Boundary



if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary





} // end k

} // end j

} // end i



// Insert Source

p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];



// Collect at Receiver

ir[n] = p[ri][rj][rk];







// Update temporal memory

for (int i=0; i<iDim; i++)

{

for (int j=0; j<jDim; j++)

{

for (int k=0; k<kDim; k++)

{

pzz[i][j][k] = pz[i][j][k];

pz[i][j][k] = p[i][j][k];

}

}

}



// Status

printf("Iteration n = %i out of %i \r", n, ns);



} // end n

#10
Posted 09/29/2010 06:17 PM   
The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?
The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

#11
Posted 09/29/2010 06:32 PM   
The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?
The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

#12
Posted 09/29/2010 06:32 PM   
I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

and compare to cpu version
I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only



if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary



if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary



if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary



if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary



if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary



if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary



and compare to cpu version

#13
Posted 09/29/2010 06:38 PM   
I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

and compare to cpu version
I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only



if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; } // Down Boundary



if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary



if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; } // Front Boundary



if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary



if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; } // Left Boundary



if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary



and compare to cpu version

#14
Posted 09/29/2010 06:38 PM   
Thanks for the tips. I will try splitting the code.

Here's the grid/block definitions:

[code]#define blockDimX 16
#define blockDimY 16

dim3 Db(blockDimX, blockDimY, 1);
dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));[/code]
Thanks for the tips. I will try splitting the code.



Here's the grid/block definitions:



#define blockDimX 16

#define blockDimY 16



dim3 Db(blockDimX, blockDimY, 1);

dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));

#15
Posted 09/29/2010 07:13 PM   
  1 / 3    
Scroll To Top