Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam

sanf · April 11, 2012, 7:02am

Hi,

The results generated by the original C program run on CPU & CUDA version of the program run on Tesla C 2050 GPU are slightly different.

GPU                      CPU

        1.942624997515            1.942624907643

        18.335377012932           18.335376984510

The output is printed using %15.12f format specifier(all are double datatype, e-7 & e-10 double range values). The program makes use of math functions sqrt, pow, log, fabs, acos etc. Initially I

 had used "sqrtf" function, but there was lot difference between the outputs of CPU & GPU. So reverted back to "sqrt".

Can some one tell me what could be the reason for inaccuracy of these results?

Thanks in advance

Gilles_C · April 11, 2012, 7:34am

Hi,
If sqrt and sqrtf doesn’t make any difference for you, tehn you’re probably computing in single precision. For compute capabilities < 1.3, doubles are silently downcast to float by the compiler. Since you target a C2050, have you tried compiling your code with “-arch=sm_20”?

sanf · April 11, 2012, 7:58am

Thanks for quick response.

The “sqrt” and “sqrtf” (i.e. any other math function with “f”) does a lot difference. When “sqrtf” was used there was huge difference. That’s why reverted back to “sqrt”
And, I’m using “-arch=sm_20” during compilation, as my device & global functions involve malloc.

Gilles_C · April 11, 2012, 8:12am

Oops, sorry, I miss-read your post.
What might happen is that either your CPU or your GPU (or both) codes doesn’t comply to the IEEE standard for floating point precision.
Have you tried to enforce this compliance with the corresponding compiler options. AFAIK, by default, nvcc is rather IEEE compliant, whereas CPU compilers are less, especially when you compile with a -O3 type option. On x86 especially, internal registers are (if I’m not mistaken) storing and computing in 80 bit arithmetic rather than 64 bit. Depending on your compiler of choice, you can enforce the strict IEEE compliance to check whether your results are actually different between CPU and GPU codes.

seibert · April 11, 2012, 11:43am

Another possibility is that this is accumulated error as a result of many operations. The CUDA implementations of the double precision transcendental functions are only guaranteed to match the correctly rounded result up to 1 or 2 units in the last place. Your CPU compiler might guarantee more or less precision than this for transcendentals. These differences can then magnify in subsequent code, depending on the math operations you apply.

Edit: Accuracy of double precision functions in CUDA is documented in Appendix C.1.2 in the CUDA Programming Guide.

sanf · April 11, 2012, 11:57am

Thanks for the detailed description.

Recompiled the code on CPU using the options taken out from gcc -v --help.

-mhard-float
-mieee-fp
-msoft-float
-fdefault-double-8
etc…

But none of these made any changes to the results

tera · April 11, 2012, 12:11pm

You cannot expect to achieve bit-for-bit agreement whatever the compiler options are (and btw., on x86 the compiler most likely uses the intrinsic functions so compiler options won’t influence the results at all).

What Every Computer Scientist Should Know About Floating-Point Arithmetic (reprint).

njuffa · April 11, 2012, 5:12pm

Check out the following NVIDIA whitepaper that explains many of the reasons for numerical results deviating between host and device computation:

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

One of the reasons pointed out in the whitepaper is the merging of multiplies and adds into FMAD or FMA instructions in device code. Since CUDA 4.1 this merging can be controlled with compilation unit granluarity by the -fmad={true|false} flag of nvcc. By specifying -fmad=false you may achieve results that match the CPU results more closely, at some (potentially significant) loss of performance.

sanf · April 12, 2012, 9:58am

Thanks njuffa. I’ll try to install 4.1 and check for more accurate results. At present, the toolkit version is 4.0

njuffa · April 12, 2012, 5:23pm

To avoid misunderstandings: The lack of bit-by-bit agreement of GPU results with results from the host does not indicate a lack of accuracy on the part of the GPU computation. In fact the use of FMA (fused multiply-add) on the GPU frequently results in improved accuracy. One way to tell which results are more accurate is to compare to a higher-precision reference implementation, a technique I use frequently in my own work.

I highly recommend reading the whitepaper I pointed to, it discusses this and other issues in a lot more detail than I can provide here.