I would like to have some clarifications on GPU computing accuracy and determinism.
Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

{
/* Evaluates the thread indices */
int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;
int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */
int id_vett = idx*dimx+idy;

__global__ void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,
int dimx, int dimy, int dimu)

{
/* Evaluates the thread indices */
int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;
int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */
int id_vett = idx*dimu+idy;

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.
What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?
Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

I would like to have some clarifications on GPU computing accuracy and determinism.

Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

I would like to have some clarifications on GPU computing accuracy and determinism.
Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

{
/* Evaluates the thread indices */
int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;
int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */
int id_vett = idx*dimx+idy;

__global__ void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,
int dimx, int dimy, int dimu)

{
/* Evaluates the thread indices */
int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;
int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */
int id_vett = idx*dimu+idy;

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.
What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?
Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

I would like to have some clarifications on GPU computing accuracy and determinism.

Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian,
thank you very much for your answer.
I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian,
thank you very much for your answer.
I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

[quote name='JFSebastian' post='1121728' date='Sep 24 2010, 10:46 AM']Christian,
thank you very much for your answer.
I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.[/quote]

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

[quote name='JFSebastian' post='1121728' date='Sep 24 2010, 10:46 AM']Christian,
thank you very much for your answer.
I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.[/quote]

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

I'm solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.
With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.
With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).
With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).[/quote]

I'm solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

I'm solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.
With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.
With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).
With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).[/quote]

I'm solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf...
In the double precision version, I'm using cos, sin, sqrt, asin, ...
In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.[/quote]

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf...

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf...
In the double precision version, I'm using cos, sin, sqrt, asin, ...
In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.[/quote]

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf...

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

I suppose it is not gpu specific but just parallel issues. Try to perform this code on cpu in parallel. You may try to run in in device emulation mode if you are using cuda 3.0 and early. Btw, does __fmul_rn improve precsion in your case?

I suppose it is not gpu specific but just parallel issues. Try to perform this code on cpu in parallel. You may try to run in in device emulation mode if you are using cuda 3.0 and early. Btw, does __fmul_rn improve precsion in your case?

Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Here is the excitation filling kernel.

[codebox]// *******************************

// * KERNEL - EXCITATION FILLING *

// *******************************

__global__ void Excitation_Filling(float *d_X, float *d_Y, float *d_Z, float d_alfa, float d_z0, float *d_x_terzo_vett, float *d_y_terzo_vett, float *d_z_terzo_vett, float *d_r_vett, float *d_theta_vett, float* d_fase_vett, cuFloatComplex *d_campo_feed, int dimx, int dimy, float beta, float mfatt)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimx+idy;

if (idx < dimx && idy < dimy) {

d_x_terzo_vett[id_vett]=-d_X[id_vett];

d_y_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],cos

f(d_alfa)),__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),sinf(d_alfa)));

d_z_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],sin

f(d_alfa)),-__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),cosf(d_alfa)));

d_r_vett[id_vett]=sqrtf(__fmul_rn(d_x_terzo_vett[id_vett],d_

x_terzo_vett[id_vett])+__fmul_rn(d_y_terzo_vett[id_vett],d_y_

terzo_vett[id_vett])+__fmul_rn(d_z_terzo_vett[id_vett],d_z_te

rzo_vett[id_vett]));

d_theta_vett[id_vett]=asinf(__fdividef(sqrtf(__fadd_rn(__fmu

l_rn(d_x_terzo_vett[id_vett],d_x_terzo_vett[id_vett]),

__fmul_rn(d_y_terzo_vett[id_vett],d_y_terzo_vett[id_vett])))

,d_r_vett[id_vett]));

d_campo_feed[id_vett]=make_cuFloatComplex(cosf(__fmul_rn(bet

a,d_r_vett[id_vett])),-sinf(__fmul_rn(beta,d_r_vett[id_vett])));

d_campo_feed[id_vett]=cuCdivf(d_campo_feed[id_vett],make_cuF

loatComplex(d_r_vett[id_vett],0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(powf(cosf(d_theta_vett[id_vett]),mfatt),0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(cosf(d_fase_vett[id_vett]),sinf(d_fase_vett[id_ve

tt])));

}

}

[/codebox]

Here is the Fourier kernel matrix filling kernel

[codebox]// **********************************

// * KERNEL - KERNEL MATRIX FILLING *

// **********************************

__global__ void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,

int dimx, int dimy, int dimu)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimu+idy;

if (idx < dimx*dimy && idy < dimu) {

d_Kernel_Matrix[id_vett]=make_cuFloatComplex(cosf(__fadd_rn(

__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d_v_vett[idy],d_Y

[idx]))),

sinf(__fadd_rn(__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d

_v_vett[idy],d_Y[idx]))));

}

}

[/codebox]

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Here is the excitation filling kernel.

[codebox]// *******************************

// * KERNEL - EXCITATION FILLING *

// *******************************

__global__ void Excitation_Filling(float *d_X, float *d_Y, float *d_Z, float d_alfa, float d_z0, float *d_x_terzo_vett, float *d_y_terzo_vett, float *d_z_terzo_vett, float *d_r_vett, float *d_theta_vett, float* d_fase_vett, cuFloatComplex *d_campo_feed, int dimx, int dimy, float beta, float mfatt)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimx+idy;

if (idx < dimx && idy < dimy) {

d_x_terzo_vett[id_vett]=-d_X[id_vett];

d_y_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],cos

f(d_alfa)),__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),sinf(d_alfa)));

d_z_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],sin

f(d_alfa)),-__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),cosf(d_alfa)));

d_r_vett[id_vett]=sqrtf(__fmul_rn(d_x_terzo_vett[id_vett],d_

x_terzo_vett[id_vett])+__fmul_rn(d_y_terzo_vett[id_vett],d_y_

terzo_vett[id_vett])+__fmul_rn(d_z_terzo_vett[id_vett],d_z_te

rzo_vett[id_vett]));

d_theta_vett[id_vett]=asinf(__fdividef(sqrtf(__fadd_rn(__fmu

l_rn(d_x_terzo_vett[id_vett],d_x_terzo_vett[id_vett]),

__fmul_rn(d_y_terzo_vett[id_vett],d_y_terzo_vett[id_vett])))

,d_r_vett[id_vett]));

d_campo_feed[id_vett]=make_cuFloatComplex(cosf(__fmul_rn(bet

a,d_r_vett[id_vett])),-sinf(__fmul_rn(beta,d_r_vett[id_vett])));

d_campo_feed[id_vett]=cuCdivf(d_campo_feed[id_vett],make_cuF

loatComplex(d_r_vett[id_vett],0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(powf(cosf(d_theta_vett[id_vett]),mfatt),0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(cosf(d_fase_vett[id_vett]),sinf(d_fase_vett[id_ve

tt])));

}

}

[/codebox]

Here is the Fourier kernel matrix filling kernel

[codebox]// **********************************

// * KERNEL - KERNEL MATRIX FILLING *

// **********************************

__global__ void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,

int dimx, int dimy, int dimu)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimu+idy;

if (idx < dimx*dimy && idy < dimu) {

d_Kernel_Matrix[id_vett]=make_cuFloatComplex(cosf(__fadd_rn(

__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d_v_vett[idy],d_Y

[idx]))),

sinf(__fadd_rn(__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d

_v_vett[idy],d_Y[idx]))));

}

}

[/codebox]

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

wwww.orangeowlsolutions.com

Basically, I'm trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Here is the excitation filling kernel.

[codebox]// *******************************

// * KERNEL - EXCITATION FILLING *

// *******************************

__global__ void Excitation_Filling(float *d_X, float *d_Y, float *d_Z, float d_alfa, float d_z0, float *d_x_terzo_vett, float *d_y_terzo_vett, float *d_z_terzo_vett, float *d_r_vett, float *d_theta_vett, float* d_fase_vett, cuFloatComplex *d_campo_feed, int dimx, int dimy, float beta, float mfatt)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimx+idy;

if (idx < dimx && idy < dimy) {

d_x_terzo_vett[id_vett]=-d_X[id_vett];

d_y_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],cos

f(d_alfa)),__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),sinf(d_alfa)));

d_z_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],sin

f(d_alfa)),-__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),cosf(d_alfa)));

d_r_vett[id_vett]=sqrtf(__fmul_rn(d_x_terzo_vett[id_vett],d_

x_terzo_vett[id_vett])+__fmul_rn(d_y_terzo_vett[id_vett],d_y_

terzo_vett[id_vett])+__fmul_rn(d_z_terzo_vett[id_vett],d_z_te

rzo_vett[id_vett]));

d_theta_vett[id_vett]=asinf(__fdividef(sqrtf(__fadd_rn(__fmu

l_rn(d_x_terzo_vett[id_vett],d_x_terzo_vett[id_vett]),

__fmul_rn(d_y_terzo_vett[id_vett],d_y_terzo_vett[id_vett])))

,d_r_vett[id_vett]));

d_campo_feed[id_vett]=make_cuFloatComplex(cosf(__fmul_rn(bet

a,d_r_vett[id_vett])),-sinf(__fmul_rn(beta,d_r_vett[id_vett])));

d_campo_feed[id_vett]=cuCdivf(d_campo_feed[id_vett],make_cuF

loatComplex(d_r_vett[id_vett],0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(powf(cosf(d_theta_vett[id_vett]),mfatt),0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(cosf(d_fase_vett[id_vett]),sinf(d_fase_vett[id_ve

tt])));

}

}

[/codebox]

Here is the Fourier kernel matrix filling kernel

[codebox]// **********************************

// * KERNEL - KERNEL MATRIX FILLING *

// **********************************

__global__ void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,

int dimx, int dimy, int dimu)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimu+idy;

if (idx < dimx*dimy && idy < dimu) {

d_Kernel_Matrix[id_vett]=make_cuFloatComplex(cosf(__fadd_rn(

__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d_v_vett[idy],d_Y

[idx]))),

sinf(__fadd_rn(__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d

_v_vett[idy],d_Y[idx]))));

}

}

[/codebox]

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I'm using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

Here is the excitation filling kernel.

[codebox]// *******************************

// * KERNEL - EXCITATION FILLING *

// *******************************

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimx+idy;

if (idx < dimx && idy < dimy) {

d_x_terzo_vett[id_vett]=-d_X[id_vett];

d_y_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],cos

f(d_alfa)),__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),sinf(d_alfa)));

d_z_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],sin

f(d_alfa)),-__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),cosf(d_alfa)));

d_r_vett[id_vett]=sqrtf(__fmul_rn(d_x_terzo_vett[id_vett],d_

x_terzo_vett[id_vett])+__fmul_rn(d_y_terzo_vett[id_vett],d_y_

terzo_vett[id_vett])+__fmul_rn(d_z_terzo_vett[id_vett],d_z_te

rzo_vett[id_vett]));

d_theta_vett[id_vett]=asinf(__fdividef(sqrtf(__fadd_rn(__fmu

l_rn(d_x_terzo_vett[id_vett],d_x_terzo_vett[id_vett]),

__fmul_rn(d_y_terzo_vett[id_vett],d_y_terzo_vett[id_vett])))

,d_r_vett[id_vett]));

d_campo_feed[id_vett]=make_cuFloatComplex(cosf(__fmul_rn(bet

a,d_r_vett[id_vett])),-sinf(__fmul_rn(beta,d_r_vett[id_vett])));

d_campo_feed[id_vett]=cuCdivf(d_campo_feed[id_vett],make_cuF

loatComplex(d_r_vett[id_vett],0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(powf(cosf(d_theta_vett[id_vett]),mfatt),0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(cosf(d_fase_vett[id_vett]),sinf(d_fase_vett[id_ve

tt])));

}

}

[/codebox]

Here is the Fourier kernel matrix filling kernel

[codebox]// **********************************

// * KERNEL - KERNEL MATRIX FILLING *

// **********************************

int dimx, int dimy, int dimu)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimu+idy;

if (idx < dimx*dimy && idy < dimu) {

d_Kernel_Matrix[id_vett]=make_cuFloatComplex(cosf(__fadd_rn(

__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d_v_vett[idy],d_Y

[idx]))),

sinf(__fadd_rn(__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d

_v_vett[idy],d_Y[idx]))));

}

}

[/codebox]

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

wwww.orangeowlsolutions.com

Christian

Christian

Christian

Christian

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian[/quote]

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian

wwww.orangeowlsolutions.com

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

[quote name='cbuchner1' post='1120521' date='Sep 21 2010, 04:31 PM']Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian[/quote]

thank you very much for your answer.

Christian

wwww.orangeowlsolutions.com

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.[/quote]

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

thank you very much for your answer.

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.[/quote]

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

thank you very much for your answer.

Well how would you know that the sequential solution is more precise?

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).[/quote]

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

wwww.orangeowlsolutions.com

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

[quote name='cbuchner1' post='1121765' date='Sep 24 2010, 01:05 PM']Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).[/quote]

wwww.orangeowlsolutions.com

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.[/quote]

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

wwww.orangeowlsolutions.com

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.[/quote]

In the double precision version, I'm using cos, sin, sqrt, asin, ...

In either cases, I'm not compiling using fast math.

[quote name='Lev' post='1122061' date='Sep 25 2010, 02:50 AM']Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

wwww.orangeowlsolutions.com