Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam
Hi,

The results generated by the original C program run on CPU & CUDA version of the program run on Tesla C 2050 GPU are slightly different.

[code]
GPU CPU
1.942624997515 1.942624907643
18.335377012932 18.335376984510
[/code]

The output is printed using %15.12f format specifier(all are double datatype, e-7 & e-10 double range values). The program makes use of math functions sqrt, pow, log, fabs, acos etc. Initially I
had used "sqrtf" function, but there was lot difference between the outputs of CPU & GPU. So reverted back to "sqrt".

Can some one tell me what could be the reason for inaccuracy of these results?

Thanks in advance
Hi,



The results generated by the original C program run on CPU & CUDA version of the program run on Tesla C 2050 GPU are slightly different.





GPU CPU

1.942624997515 1.942624907643

18.335377012932 18.335376984510




The output is printed using %15.12f format specifier(all are double datatype, e-7 & e-10 double range values). The program makes use of math functions sqrt, pow, log, fabs, acos etc. Initially I

had used "sqrtf" function, but there was lot difference between the outputs of CPU & GPU. So reverted back to "sqrt".



Can some one tell me what could be the reason for inaccuracy of these results?



Thanks in advance

#1
Posted 04/11/2012 07:02 AM   
Hi,
If sqrt and sqrtf doesn't make any difference for you, tehn you're probably computing in single precision. For compute capabilities < 1.3, doubles are silently downcast to float by the compiler. Since you target a C2050, have you tried compiling your code with "-arch=sm_20"?
Hi,

If sqrt and sqrtf doesn't make any difference for you, tehn you're probably computing in single precision. For compute capabilities < 1.3, doubles are silently downcast to float by the compiler. Since you target a C2050, have you tried compiling your code with "-arch=sm_20"?

#2
Posted 04/11/2012 07:34 AM   
Thanks for quick response.

The "sqrt" and "sqrtf" (i.e. any other math function with "f") does a lot difference. When "sqrtf" was used there was huge difference. That's why reverted back to "sqrt"
And, I'm using "-arch=sm_20" during compilation, as my __device__ & __global__ functions involve malloc.
Thanks for quick response.



The "sqrt" and "sqrtf" (i.e. any other math function with "f") does a lot difference. When "sqrtf" was used there was huge difference. That's why reverted back to "sqrt"

And, I'm using "-arch=sm_20" during compilation, as my __device__ & __global__ functions involve malloc.

#3
Posted 04/11/2012 07:58 AM   
Oops, sorry, I miss-read your post.
What might happen is that either your CPU or your GPU (or both) codes doesn't comply to the IEEE standard for floating point precision.
Have you tried to enforce this compliance with the corresponding compiler options. AFAIK, by default, nvcc is rather IEEE compliant, whereas CPU compilers are less, especially when you compile with a -O3 type option. On x86 especially, internal registers are (if I'm not mistaken) storing and computing in 80 bit arithmetic rather than 64 bit. Depending on your compiler of choice, you can enforce the strict IEEE compliance to check whether your results are actually different between CPU and GPU codes.
Oops, sorry, I miss-read your post.

What might happen is that either your CPU or your GPU (or both) codes doesn't comply to the IEEE standard for floating point precision.

Have you tried to enforce this compliance with the corresponding compiler options. AFAIK, by default, nvcc is rather IEEE compliant, whereas CPU compilers are less, especially when you compile with a -O3 type option. On x86 especially, internal registers are (if I'm not mistaken) storing and computing in 80 bit arithmetic rather than 64 bit. Depending on your compiler of choice, you can enforce the strict IEEE compliance to check whether your results are actually different between CPU and GPU codes.

#4
Posted 04/11/2012 08:12 AM   
Another possibility is that this is accumulated error as a result of many operations. The CUDA implementations of the double precision transcendental functions are only guaranteed to match the correctly rounded result up to 1 or 2 units in the last place. Your CPU compiler might guarantee more or less precision than this for transcendentals. These differences can then magnify in subsequent code, depending on the math operations you apply.

Edit: Accuracy of double precision functions in CUDA is documented in Appendix C.1.2 in the CUDA Programming Guide.
Another possibility is that this is accumulated error as a result of many operations. The CUDA implementations of the double precision transcendental functions are only guaranteed to match the correctly rounded result up to 1 or 2 units in the last place. Your CPU compiler might guarantee more or less precision than this for transcendentals. These differences can then magnify in subsequent code, depending on the math operations you apply.



Edit: Accuracy of double precision functions in CUDA is documented in Appendix C.1.2 in the CUDA Programming Guide.

#5
Posted 04/11/2012 11:43 AM   
Thanks for the detailed description.

Recompiled the code on CPU using the options taken out from gcc -v --help.

-mhard-float
-mieee-fp
-msoft-float
-fdefault-double-8
etc...

But none of these made any changes to the results
Thanks for the detailed description.



Recompiled the code on CPU using the options taken out from gcc -v --help.



-mhard-float

-mieee-fp

-msoft-float

-fdefault-double-8

etc...



But none of these made any changes to the results

#6
Posted 04/11/2012 11:57 AM   
You cannot expect to achieve bit-for-bit agreement whatever the compiler options are (and btw., on x86 the compiler most likely uses the intrinsic functions so compiler options won't influence the results at all).

[url="http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html"]What Every Computer Scientist Should Know About Floating-Point Arithmetic (reprint)[/url].
You cannot expect to achieve bit-for-bit agreement whatever the compiler options are (and btw., on x86 the compiler most likely uses the intrinsic functions so compiler options won't influence the results at all).



What Every Computer Scientist Should Know About Floating-Point Arithmetic (reprint).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#7
Posted 04/11/2012 12:11 PM   
Check out the following NVIDIA whitepaper that explains many of the reasons for numerical results deviating between host and device computation:

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

One of the reasons pointed out in the whitepaper is the merging of multiplies and adds into FMAD or FMA instructions in device code. Since CUDA 4.1 this merging can be controlled with compilation unit granluarity by the -fmad={true|false} flag of nvcc. By specifying -fmad=false you may achieve results that match the CPU results more closely, at some (potentially significant) loss of performance.
Check out the following NVIDIA whitepaper that explains many of the reasons for numerical results deviating between host and device computation:



http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf



One of the reasons pointed out in the whitepaper is the merging of multiplies and adds into FMAD or FMA instructions in device code. Since CUDA 4.1 this merging can be controlled with compilation unit granluarity by the -fmad={true|false} flag of nvcc. By specifying -fmad=false you may achieve results that match the CPU results more closely, at some (potentially significant) loss of performance.

#8
Posted 04/11/2012 05:12 PM   
Thanks njuffa. I'll try to install 4.1 and check for more accurate results. At present, the toolkit version is 4.0
Thanks njuffa. I'll try to install 4.1 and check for more accurate results. At present, the toolkit version is 4.0

#9
Posted 04/12/2012 09:58 AM   
To avoid misunderstandings: The lack of bit-by-bit agreement of GPU results with results from the host does not indicate a lack of accuracy on the part of the GPU computation. In fact the use of FMA (fused multiply-add) on the GPU frequently results in improved accuracy. One way to tell which results are more accurate is to compare to a higher-precision reference implementation, a technique I use frequently in my own work.

I highly recommend reading the whitepaper I pointed to, it discusses this and other issues in a lot more detail than I can provide here.
To avoid misunderstandings: The lack of bit-by-bit agreement of GPU results with results from the host does not indicate a lack of accuracy on the part of the GPU computation. In fact the use of FMA (fused multiply-add) on the GPU frequently results in improved accuracy. One way to tell which results are more accurate is to compare to a higher-precision reference implementation, a technique I use frequently in my own work.



I highly recommend reading the whitepaper I pointed to, it discusses this and other issues in a lot more detail than I can provide here.

#10
Posted 04/12/2012 05:23 PM   
Scroll To Top