Help understanding sqrt functions in CUDA
Hi All

I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.


Thank you

Henrik Andresen
Hi All



I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.



The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?



I run the tests on a GTX480 using Cuda Toolkit 4.0.





Thank you



Henrik Andresen

#1
Posted 05/07/2012 12:55 PM   
sqrtf() is a single-precision square root function that can map either to an approximate square root implementation, or one that rounds to nearest or even according to the IEEE-754 standard.

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.
sqrtf() is a single-precision square root function that can map either to an approximate square root implementation, or one that rounds to nearest or even according to the IEEE-754 standard.



On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.



__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.



Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.

#2
Posted 05/11/2012 11:07 AM   
Hi Njuffa

Thank you for your reply. That clarified things!

Cheers

Henrik Andresen
Hi Njuffa



Thank you for your reply. That clarified things!



Cheers



Henrik Andresen

#3
Posted 05/11/2012 11:09 AM   
Scroll To Top