Accuracy in GPU floating point calculations

I would like to have some clarifications on GPU computing accuracy and determinism.

Basically, I’m trying to implement a 1-D Fourier transform by a cublasCgemm routine call and to compile it as a Matlab mex function. So, I have two kernels: one to calculate the vector to be transformed and one to calculate the Fourier kernel matrix. Then, I have the mex function calling the two kernels and the cublasCgemm function.

Here is the excitation filling kernel.

[codebox]// *******************************

// * KERNEL - EXCITATION FILLING *

// *******************************

global void Excitation_Filling(float *d_X, float *d_Y, float *d_Z, float d_alfa, float d_z0, float *d_x_terzo_vett, float *d_y_terzo_vett, float *d_z_terzo_vett, float *d_r_vett, float d_theta_vett, float d_fase_vett, cuFloatComplex *d_campo_feed, int dimx, int dimy, float beta, float mfatt)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimx+idy;

if (idx < dimx && idy < dimy) {

d_x_terzo_vett[id_vett]=-d_X[id_vett];

   d_y_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],cos

f(d_alfa)),__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),sinf(d_alfa)));

   d_z_terzo_vett[id_vett]=__fadd_rn(__fmul_rn(d_Y[id_vett],sin

f(d_alfa)),-__fmul_rn(__fadd_rn(d_Z[id_vett],-(d_z0)),cosf(d_alfa)));

d_r_vett[id_vett]=sqrtf(_fmul_rn(d_x_terzo_vett[id_vett],d

x_terzo_vett[id_vett])+_fmul_rn(d_y_terzo_vett[id_vett],d_y

terzo_vett[id_vett])+__fmul_rn(d_z_terzo_vett[id_vett],d_z_te

rzo_vett[id_vett]));

   d_theta_vett[id_vett]=asinf(__fdividef(sqrtf(__fadd_rn(__fmu

l_rn(d_x_terzo_vett[id_vett],d_x_terzo_vett[id_vett]),

                                                __fmul_rn(d_y_terzo_vett[id_vett],d_y_terzo_vett[id_vett])))

,d_r_vett[id_vett]));

d_campo_feed[id_vett]=make_cuFloatComplex(cosf(__fmul_rn(bet

a,d_r_vett[id_vett])),-sinf(__fmul_rn(beta,d_r_vett[id_vett])));

   d_campo_feed[id_vett]=cuCdivf(d_campo_feed[id_vett],make_cuF

loatComplex(d_r_vett[id_vett],0.0f));

   d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(powf(cosf(d_theta_vett[id_vett]),mfatt),0.0f));

d_campo_feed[id_vett]=cuCmulf(d_campo_feed[id_vett],make_cuF

loatComplex(cosf(d_fase_vett[id_vett]),sinf(d_fase_vett[id_ve

tt])));

}

}

[/codebox]

Here is the Fourier kernel matrix filling kernel

[codebox]// **********************************

// * KERNEL - KERNEL MATRIX FILLING *

// **********************************

global void Kernel_Matrix_Filling(float *d_X, float *d_Y, float *d_Z, float *d_u_vett, float *d_v_vett, cuFloatComplex *d_Kernel_Matrix,

                                  int dimx, int dimy, int dimu)

{

/* Evaluates the thread indices */

int idx = __umul24(blockIdx.x,BLOCK_SIZE_X) + threadIdx.x;

int idy = __umul24(blockIdx.y,BLOCK_SIZE_Y) + threadIdx.y;

/* Evaluates the matrix filling index */

int id_vett = idx*dimu+idy;

if (idx < dimx*dimy && idy < dimu) {

d_Kernel_Matrix[id_vett]=make_cuFloatComplex(cosf(__fadd_rn(

__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d_v_vett[idy],d_Y

[idx]))),

                                                sinf(__fadd_rn(__fmul_rn(d_u_vett[idy],d_X[idx]),__fmul_rn(d

_v_vett[idy],d_Y[idx]))));

}

}

[/codebox]

Since the beginning, I had accuracy issues, which were mitigated by use of __fadd_rn and __fmul_rn. However, the final accuracy I get is 1e-3 or 1e-4, which is not enough for my application. I’m using single precision since I have a GT 230M card.

What could be the source of such large errors ? Use of cos/sin/sqrt functions ? Synchronization ?

Strangely, the result changes when I change blocksize and dimgrid, the faster the algorithm runs, the poorer the result is. I have separately tested the cublasCgemm routine call and checked it returns with a satisfactory accuracy.

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian

Implicit rounding to zero for denormals comes to my mind (only on Compute 1.x hardware). Is there any part in your computation where you might get extremely small operands (even just intermediate results)?

Christian

Christian,

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

Christian,

thank you very much for your answer.

I just turned the code to double precision and run it on a GTX 260 card. The accuracy is now much better, although still poorer as compared to a double precision sequential code.

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

Well how would you know that the sequential solution is more precise?

You can only know if you compare this to a reference solution that was computed with arbitrary precision (Mathematica offers that feature).

I’m solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

I’m solving an antenna synthesis problem. so my task is optimizing a functional (defined by a mex-cuda file) by the fminunc matlab routine.

With a sequentially (matlab) implemented functional, I get a certain functional value following the optimization, say -33dB, and the synthesis result is satisfactory.

With a CUDA implementation of the functional in single precision, I get a functional value of -24dB, which is poorer than the sequential case (the result is meaningless).

With the same CUDA implementation of the functional, but with double precision arithmetics, and when running the code on a GTX 260 card (which is supposed to deal with NON-compliant IEEE double precision), I get a functional value of -31dB, which is much closer to the sequential result, but still not fully satisfactory. Surprisingly, the same code run on a Tesla card (which is supposed to handle IEEE compliant double precision), although much faster, performed as -30dB, with poorer result as compared to the GTX 260 case.

Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

Are you using sinf? It is float function and if you use switch fast math is uses approximate function. Description of functions tells presision.

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf…

In the double precision version, I’m using cos, sin, sqrt, asin, …

In either cases, I’m not compiling using fast math.

In the floating point version of the CUDA routine, I was using cosf, sinf, sqrtf, asinf…

In the double precision version, I’m using cos, sin, sqrt, asin, …

In either cases, I’m not compiling using fast math.

I suppose it is not gpu specific but just parallel issues. Try to perform this code on cpu in parallel. You may try to run in in device emulation mode if you are using cuda 3.0 and early. Btw, does __fmul_rn improve precsion in your case?

I suppose it is not gpu specific but just parallel issues. Try to perform this code on cpu in parallel. You may try to run in in device emulation mode if you are using cuda 3.0 and early. Btw, does __fmul_rn improve precsion in your case?

Btw, you may want to check it with different matrix mul implementations, from different libriries. Cula, Magma, Plasma. Also, you may perform matrix multiplication by simple routine from cuda sdk example. Also I suggest to double check compiler switches like flush to zero mode and compute capability etc.
Sorry, did not read end of your post. Although simple matrix mul routine will be interesting.
Did you use right compute capability for doubles btw?
Also what will be if you use floats in cpu version?

Btw, you may want to check it with different matrix mul implementations, from different libriries. Cula, Magma, Plasma. Also, you may perform matrix multiplication by simple routine from cuda sdk example. Also I suggest to double check compiler switches like flush to zero mode and compute capability etc.
Sorry, did not read end of your post. Although simple matrix mul routine will be interesting.
Did you use right compute capability for doubles btw?
Also what will be if you use floats in cpu version?

I will try to run the code under a device emulation mode (although I’m currently using cuda 3.1).
In any case, for the floating point version of the code, __fmul_rn did help a lot for the accuracy.

I have carefully checked the code outcomes, in an instruction by instruction fashion, and I noticed that the evaluation of cos and sin compromised a lot the accuracy, as well as reciprocals, sqrt and asin. This seems to be a most significant source of inaccuracy as compared to matrix multiplication.

On the Tesla card, I’m compiling like this:
system(sprintf(‘nvcc -I/matlab/extern/include -I/home/work/NVIDIA_GPU_Computing_SDK/C/common/inc -arch sm_20 --cuda “Campo_scalato_CUDA_v6.cu” -ftz=false -prec-div=false -prec-sqrt=false --output-file “Campo_scalato_CUDA_v6.cpp”’, matlabroot));
mex -I/matlab/extern/include -L/usr/local/cuda/lib64 Campo_scalato_CUDA_v6.cpp -lcudart -lcublas

On the GTX260 card, I used -arch sm_13 instead of -arch sm_20.

Concerning the options -ftz=false -prec-div=false -prec-sqrt=false, I have checked with any combination of “true” and “false”, but the result is always the same…

I will try to run the code under a device emulation mode (although I’m currently using cuda 3.1).
In any case, for the floating point version of the code, __fmul_rn did help a lot for the accuracy.

I have carefully checked the code outcomes, in an instruction by instruction fashion, and I noticed that the evaluation of cos and sin compromised a lot the accuracy, as well as reciprocals, sqrt and asin. This seems to be a most significant source of inaccuracy as compared to matrix multiplication.

On the Tesla card, I’m compiling like this:
system(sprintf(‘nvcc -I/matlab/extern/include -I/home/work/NVIDIA_GPU_Computing_SDK/C/common/inc -arch sm_20 --cuda “Campo_scalato_CUDA_v6.cu” -ftz=false -prec-div=false -prec-sqrt=false --output-file “Campo_scalato_CUDA_v6.cpp”’, matlabroot));
mex -I/matlab/extern/include -L/usr/local/cuda/lib64 Campo_scalato_CUDA_v6.cpp -lcudart -lcublas

On the GTX260 card, I used -arch sm_13 instead of -arch sm_20.

Concerning the options -ftz=false -prec-div=false -prec-sqrt=false, I have checked with any combination of “true” and “false”, but the result is always the same…

I will try to run the code under a device emulation mode (although I’m currently using cuda 3.1).
In any case, for the floating point version of the code, __fmul_rn did help a lot for the accuracy.

I have carefully checked the code outcomes, in an instruction by instruction fashion, and I noticed that the evaluation of cos and sin compromised a lot the accuracy, as well as reciprocals, sqrt and asin. This seems to be a most significant source of inaccuracy as compared to matrix multiplication.

On the Tesla card, I’m compiling like this:
system(sprintf(‘nvcc -I/matlab/extern/include -I/home/work/NVIDIA_GPU_Computing_SDK/C/common/inc -arch sm_20 --cuda “Campo_scalato_CUDA_v6.cu” -ftz=false -prec-div=false -prec-sqrt=false --output-file “Campo_scalato_CUDA_v6.cpp”’, matlabroot));
mex -I/matlab/extern/include -L/usr/local/cuda/lib64 Campo_scalato_CUDA_v6.cpp -lcudart -lcublas

On the GTX260 card, I used -arch sm_13 instead of -arch sm_20.

Concerning the options -ftz=false -prec-div=false -prec-sqrt=false, I have checked with any combination of “true” and “false”, but the result is always the same…