I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.

I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.

sqrtf() is a single-precision square root function that can map either to an approximate square root implementation, or one that rounds to nearest or even according to the IEEE-754 standard.

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.

sqrtf() is a single-precision square root function that can map either to an approximate square root implementation, or one that rounds to nearest or even according to the IEEE-754 standard.

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.

I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.

Thank you

Henrik Andresen

I'm doing some performance testing to evaluate different functions in CUDA, and I have come upon the functions to calculate the square root. Here is both a normal 'sqrtf' and an intrinsic '__fsqrt_rn'.

The second is approximately three times slower. Is the only difference numerical accuracy? Or am I reading the CUDA C Programming Guide wrong?

I run the tests on a GTX480 using Cuda Toolkit 4.0.

Thank you

Henrik Andresen

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.

On sm_1x devices, sqrtf() always maps to the approximate square root implementation. On sm_2x and sm_3x devices the mapping is controlled by the compiler flag -prec-sqrt={true|false}. The default setting is "true". When -prec-sqrt=false is specified, sqrtf() maps to the approximate square root implementation, with -prec-sqrt=true it maps to the IEEE-rounded one. -use_fast_math implies -prec-sqrt=false.

__fsqrt_rn() always maps to an implementation that rounds to nearest-or-even according to the IEEE-754 standard. It is quite slow on sm_1x devices since the hardware does not support the single-precision FMA (fused multiply-add) operation which is crucial to high performance implementations of correctly rounded square root.

Even on sm_2x and sm_3x devices significant performance differences between approximate and IEEE-rounded versions can be observed, which is simply a consequence of the work necessary to guarantee the standard compliant result. Over successive generations of CUDA, a lot of work has gone into providing optimized implementations of such correctly rounded mathematical primitives.

Thank you for your reply. That clarified things!

Cheers

Henrik Andresen

Thank you for your reply. That clarified things!

Cheers

Henrik Andresen