Same Code (really, it is) - Much Different Results

Hi All,

Smoke coming out of my ears; Have been struggling with a problem for over a month now.
Desperately in need of expert advice; Please help if you can smile.gif.

I am implementing a (very) simple 3D finite difference algorithm using Cuda C on a Tesla C1070.

The code basically loops over all elements in a 3D array and performs some basic arithmetic for each element (5 adds, 1 mul).
This happens iteratively, about 8000 times, where each iteration is calculated based on the data calculated in the iteration before; and so on and so forth.

The values at a specific point in the array are recorded, and this is the output of the algorithm.

This is implemented once on the GPU, and for reference on the CPU. Both codes are double precision floating points.
(sm-13 is enabled, and I am aware of the fact that floating point calculations can deviate from symbolic arithmetic).

The CPU code works like a charm.
The GPU code, however, produces similar results but with a linearly growing error.
That is, a scalar value is somehow recursively being added to each sample in the output.

Imagine the same results, but with a constantly growing shift in magnitude in the GPU code.

Obviously, looking for some variable that is not cleared etc… is the first thing I did.
Nothing there. I promise. The codes are identical (didn’t want to make this post too heavy, but I can strip it down and upload if needed).

Does anybody have a clue why something like this could be happening?

Any help would be highly appreciated.

Cheers,

Jon.

You probably write somewhere a[i]=i instead of a[i]=c[i]

You probably write somewhere a[i]=i instead of a[i]=c[i]

Nope. Searched my code upside down… :)

Any other ideas?

Nope. Searched my code upside down… :)

Any other ideas?

post your code at last

post your code at last

OK.

I am processing a 3D array slice by slice.

Here’s the kernel:

__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	double wR = 1.0;

	double pCoef = 1.0/3.0;

	// Solve for pressure

	if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

	{

		// We are inside the domain

		p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

		

	}

	if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }		// Down Boundary

	if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

	if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }		// Front Boundary

	if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

	if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }		// Left Boundary

	if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

	// Inject Source

	if ((i == si) && (j == sj) && (k == sk)) {

			p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

		}

	// Collect at Reciever

	if ((i == ri) && (j == rj) && (k == rk)) {

			ir[n] = p[recPos]; }

	// Wait for all

	__syncthreads();

	

}

with the striding macros defined as:

#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))

And on the host side:

// Loop through time

		for (int n=2; n<ns; n++) {

			// Execute Pressure Kernel

			cudaPrintfInit();

			// Loop through k-dimension tiles

			for (int k=0; k<kDim; k++)

			{

				pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

				cutilCheckMsg("Kernel execution failed\n");

				cudaPrintfDisplay(stdout, true);

			}

			cudaThreadSynchronize();

			cudaPrintfEnd();

			// Update temporal memory

			cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

			cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

			cudaMemset((void**)pCuda, 0, dataSize);

			// Status

			printf("GPU Iteration n = %i out of %i \r", n, ns);			

		} // end n

And this is the reference code for the CPU:

// Run the Simulation on CPU

		for (int n=2; n<ns; n++)

		{

			for (int i=0; i<iDim; i++)

			{

				for (int j=0; j<jDim; j++)

				{

					for (int k=0; k<kDim; k++)

					{

						//// Where are we? 

						if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

						{

							// We are inside the domain

							p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

						}

						if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; }		// Down Boundary

						if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

						if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; }		// Front Boundary

						if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

						if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; }		// Left Boundary

						if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary

					} // end k

				} // end j

			} // end i

			// Insert Source

			p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

			// Collect at Receiver

			ir[n] = p[ri][rj][rk];

			// Update temporal memory

			for (int i=0; i<iDim; i++)

			{

				for (int j=0; j<jDim; j++)

				{

					for (int k=0; k<kDim; k++)

					{

						pzz[i][j][k] = pz[i][j][k];

						pz[i][j][k] = p[i][j][k];

					}

				}

			}

			// Status

			printf("Iteration n = %i out of %i \r", n, ns);

		} // end n

OK.

I am processing a 3D array slice by slice.

Here’s the kernel:

__global__ void pKernel(double* p, double* pz, double* pzz, double* sourceFn, double* ir, int iDim, int jDim, int kDim, int si, int sj, int sk, int ri, int rj, int rk, int n, int k)

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	int j = blockIdx.y * blockDim.y + threadIdx.y;

	double wR = 1.0;

	double pCoef = 1.0/3.0;

	// Solve for pressure

	if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

	{

		// We are inside the domain

		p[ijk] = (pCoef*(pz[ip] + pz[im] + pz[jp] + pz[jm] + pz[kp] + pz[km])) - pzz[ijk];

		

	}

	if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }		// Down Boundary

	if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

	if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }		// Front Boundary

	if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

	if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }		// Left Boundary

	if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

	// Inject Source

	if ((i == si) && (j == sj) && (k == sk)) {

			p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];

		}

	// Collect at Reciever

	if ((i == ri) && (j == rj) && (k == rk)) {

			ir[n] = p[recPos]; }

	// Wait for all

	__syncthreads();

	

}

with the striding macros defined as:

#define ijk (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define im (((i-1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define ip (((i+1) * (jDim) * (kDim)) + ((j) * (kDim)) + (k))

#define jm (((i) * (jDim) * (kDim)) + ((j-1) * (kDim)) + (k))

#define jp (((i) * (jDim) * (kDim)) + ((j+1) * (kDim)) + (k))

#define km (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k-1))

#define kp (((i) * (jDim) * (kDim)) + ((j) * (kDim)) + (k+1))

#define sourcePos (((si) * (jDim) * (kDim)) + ((sj) * (kDim)) + (sk))

#define recPos (((ri) * (jDim) * (kDim)) + ((rj) * (kDim)) + (rk))

And on the host side:

// Loop through time

		for (int n=2; n<ns; n++) {

			// Execute Pressure Kernel

			cudaPrintfInit();

			// Loop through k-dimension tiles

			for (int k=0; k<kDim; k++)

			{

				pKernel<<<Dg, Db>>>(pCuda, pzCuda, pzzCuda, sourceFnCuda, irCuda, iDim, jDim, kDim, si, sj, sk, ri, rj, rk, n, k);

				cutilCheckMsg("Kernel execution failed\n");

				cudaPrintfDisplay(stdout, true);

			}

			cudaThreadSynchronize();

			cudaPrintfEnd();

			// Update temporal memory

			cudaMemcpy(pzzCuda, pzCuda, dataSize, cudaMemcpyDeviceToDevice);

			cudaMemcpy(pzCuda, pCuda, dataSize, cudaMemcpyDeviceToDevice);

			cudaMemset((void**)pCuda, 0, dataSize);

			// Status

			printf("GPU Iteration n = %i out of %i \r", n, ns);			

		} // end n

And this is the reference code for the CPU:

// Run the Simulation on CPU

		for (int n=2; n<ns; n++)

		{

			for (int i=0; i<iDim; i++)

			{

				for (int j=0; j<jDim; j++)

				{

					for (int k=0; k<kDim; k++)

					{

						//// Where are we? 

						if ((i>0) && (i<iDim-1) && (j>0) && (j<jDim-1) && (k>0) && (k<kDim-1))

						{

							// We are inside the domain

							p[i][j][k] = (pCoef*(pz[i+1][j][k] + pz[i-1][j][k] + pz[i][j+1][k] + pz[i][j-1][k] + pz[i][j][k+1] + pz[i][j][k-1])) - pzz[i][j][k];

						}

						if (k==0) { p[i][j][k] = ((1+wR)*pz[i][j][k+1]) - wR*pzz[i][j][k]; }		// Down Boundary

						if (k==kDim-1) { p[i][j][k] = ((1+wR)*pz[i][j][k-1]) -wR*pzz[i][j][k]; } // Up Boundary

						if (j==0) { p[i][j][k] = ((1+wR)*pz[i][j+1][k]) -wR*pzz[i][j][k]; }		// Front Boundary

						if (j==jDim-1) { p[i][j][k] = ((1+wR)*pz[i][j-1][k]) -wR*pzz[i][j][k]; } // Back Boundary

						if (i==0) { p[i][j][k] = ((1+wR)*pz[i+1][j][k]) -wR*pzz[i][j][k]; }		// Left Boundary

						if (i==iDim-1) { p[i][j][k] = ((1+wR)*pz[i-1][j][k]) + (-wR*pzz[i][j][k]); } // Right Boundary

					} // end k

				} // end j

			} // end i

			// Insert Source

			p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

			// Collect at Receiver

			ir[n] = p[ri][rj][rk];

			// Update temporal memory

			for (int i=0; i<iDim; i++)

			{

				for (int j=0; j<jDim; j++)

				{

					for (int k=0; k<kDim; k++)

					{

						pzz[i][j][k] = pz[i][j][k];

						pz[i][j][k] = p[i][j][k];

					}

				}

			}

			// Status

			printf("Iteration n = %i out of %i \r", n, ns);

		} // end n

The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

The code does not include difinition for block size and block number. Are you usre you have not threads outside domain?

I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

and compare to cpu version

I suggest to split cuda kernel on 3 parts, and debug not complete implementation, but partial. for example, stay only

if (k==0) { p[ijk] = (1.0+wR)*pz[kp] - wR*pzz[ijk]; }        // Down Boundary

if (k==kDim-1) { p[ijk] = (1.0+wR)*pz[km] - wR*pzz[ijk]; } // Up Boundary

if (j==0) { p[ijk] = (1.0+wR)*pz[jp] - wR*pzz[ijk]; }        // Front Boundary

if (j==jDim-1) { p[ijk] = (1.0+wR)*pz[jm] - wR*pzz[ijk]; } // Back Boundary

if (i==0) { p[ijk] = (1.0+wR)*pz[ip] - wR*pzz[ijk]; }        // Left Boundary

if (i==iDim-1) { p[ijk] = (1.0+wR)*pz[im] - wR*pzz[ijk]; } // Right Boundary

and compare to cpu version

Thanks for the tips. I will try splitting the code.

Here’s the grid/block definitions:

#define blockDimX 16

#define blockDimY 16

		dim3 Db(blockDimX, blockDimY, 1);

		dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));

Thanks for the tips. I will try splitting the code.

Here’s the grid/block definitions:

#define blockDimX 16

#define blockDimY 16

		dim3 Db(blockDimX, blockDimY, 1);

		dim3 Dg(ceil((float)iDim/blockDimX), ceil((float)jDim/blockDimY));

If your problem size is not multiple of 16, you may spawn additional threads behind border. see ceil(idim)/blockdymx. And they may spoil data.

If your problem size is not multiple of 16, you may spawn additional threads behind border. see ceil(idim)/blockdymx. And they may spoil data.

I’m testing for redundant threads, but I tried adjusting the problem size to multiples of 16 just in case;
Still no luck… External Image

The result (a calculated impulse response) looks like there is some accumulated error.

It fits the profile of numerical instability of an algorithm; but at least from a symbolic point of view - the algorithm should be stable (and is stable on the CPU)…

BTW Thanks for taking the time to help. Its appreciated.

I’m testing for redundant threads, but I tried adjusting the problem size to multiples of 16 just in case;
Still no luck… External Image

The result (a calculated impulse response) looks like there is some accumulated error.

It fits the profile of numerical instability of an algorithm; but at least from a symbolic point of view - the algorithm should be stable (and is stable on the CPU)…

BTW Thanks for taking the time to help. Its appreciated.

if ((i == si) && (j == sj) && (k == sk)) {
p[sourcePos] = p[sourcePos] + sourceFn[n] + sourceFn[n-2];
}

        // Insert Source
        p[si][sj][sk] = p[si][sj][sk] + sourceFn[n] - sourceFn[n-2];

Look like it is not right.