Fermi and Kepler GPU Special Function Units

The Fermi GPUs have Special Function Units (SFUs) to (quoting the NVIDIA White Paper on Fermi) "execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock".

My questions are:

  1. Do SFUs operate on single and double precision numbers or on single precision only?
  2. Do SFUs introduce any loss of accuracy in the computations?
  3. Are SFUs related to the use of intrinsics like __sin(), __cos(), etc.?
  4. Are the functionalities of the Kepler SFUs the same as for the Fermi SFUs?

Thank you very much in advance for any answer.

  1. The SFUs work on single precision numbers only.

  2. Yes, see #3.

  3. The SFU instructions are the implementation of the intrinsic functions like __sin(), __cos(), etc. Those functions have limited precision, as detailed in Table 7 of the CUDA Programming Guide. When you call cos(), you do not use the SFU, but instead perform several FMAD instructions that implement a more precise approximation of the trascendental function.

If you pass -ffast-math to nvcc, it will automatically use the intrinsic versions of the transcendentals, otherwise you have to call them explicitly.

  1. This I don’t know. I haven’t seen any indication in the documentation, but I’m not sure.

Re (3): I think a better way of looking at this is that the device intrinsics __log2f(), __sinf(), __cosf() expose the instructions implemented by the special function unit :-) The HW implementation is based on quadratic interpolation in ROM tables using fixed-point arithmetic, as described in the following paper:

Stuart F. Oberman and Michael Siu. A high-performance area-efficient multifunction interpolator. In Proceedings of the 17th IEEE Symposium on Computer Arithmetic (Cap Cod, USA), pages 272–279, July 2005.

Re (4): I am not aware of any functional differences between the Fermi and Kepler special function units. Side remark: The special-function instructions actually show up in disassembled SASS code as MUFU.{LG|EX2|SIN|COS|RCP|RSQ} for sm_20 and up. I assume MUFU stands for “multi-function unit”. The relative throughput of these instructions was improved on Kepler compared to Fermi.

And what about square-root function? Is it performed by SFU for both intrinsic and non-intrinsic? Or the
non-intrinsic one is ‘done elsewhere’?

MK

If you look at the PTX, there is sqrt.approx.f32 and sqrt.rn.f32. The former is an approximate single-precision square root implemented via MUFU.RSQ and MUFU.RCP, while the latter is a single-precision square root with IEEE-754 rounding to nearest-or-even which maps to a sequence of quite a few instructions, one of which is MUFU.RSQ. By disassembling code that contains a call to sqrtf() with cuobjdump --dump-sass, you can easily check this yourself.

On sm_1x, sqrtf() always maps to sqrt.approx.f32, for newer platforms sqrtf() maps to sqrt.rn.f32 by default, but maps to sqrt.approx.f32 if -prec-sqrt=false or -use_fast_math is passed on the nvcc command line. To get a IEEE-754 rounded single-precision square root on sm_1x, one has to use the intrinsic __fsqrt_rn(), which maps to fairly slow emulation code.

Can you clarify which intrinsics operate in one clock on the SFU? For simplicity, answer for sm_2x and above.

__fsqrt_rd()
__fsqrt_rn()
__fsqrt_ru()
__fsqrt_rz()

Also, how many clocks does __powf() take?

Lastly (and now I know I am wrong), I had thought that all single precision intrinsics listed here were one cycle.

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__MATH__INTRINSIC__SINGLE.html (I realize this is 4.2, but it is where google takes me – I care about 5.0).

How can I know which intrinsics operate in one clock?

The fact that a function is provided as an intrinsic (with leading double underscore, only available in device code) does not imply anything in particular about performance. The performance of single-precision intrinsics can also vary with compilation mode, in particular -ftz={true|false}. I would suggest measuring the throughput of those functions you care about, on a relevant GPU with relevant compiler switches. I have not had the need to perform such measurements for any app optimization work.

Thank you very much to njuffa for the answers and the very interesting suggested paper which helped me to have a better picture of how SFUs work. As long as I understand, the HW calculating the intrinsic functions basically implement an algorithm approximating those functions by a quadratic polynomial. The coefficient of such a polynomial are determined by a minimax optimization, which I think amounts at approximating the function by a second-order Chebyshev polynomial (which is the solution of a minimax problem).

I have another couple of questions:

  1. Could you recommend any reference describing how the single/double precision (non-intrinsic) transcendental functions are calculated by CUDA?
  2. In many engineering applications, a commonly used function is the sinc function (sin(x)/x), which is the composition of sin(x) and the reciprocal of x. Of course, there are also others (e.g., sinh(x)/x, hamming, hanning) functions representative of filters etc.. It would be interesting to have a fast way, to be implemented by the developers, to directly calculating those function compositions rather than calculating each function component. I guess that the calculation strategy should adapt to the function characteristics. Could you recommend any reference or book giving guidelines on this topic?

Thank you again.

Lastly, and concerning rjl’s comment, from the white paper “NVIDIA’s Fermi: The First Complete
GPU Computing Architecture”, by Peter N. Glaskowsky (<a target=‘_blank’ rel=‘noopener noreferrer’ href='Page Not Found | NVIDIA>http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_Architecture.pdf), page 21: “A warp of 32 special-function
instructions is issued in a single cycle but takes eight cycles to complete on the four
SFUs” on the Fermi architecture, see also Fig. 7. Perhaps this will add a piece of information to answer your question…

Please note that CUDA intrinsics rarely map to a single SFU/MUFU instruction, but usually map to sequences of multiple SFU and non-SFU instructions. Different GPU have different throughputs for the various operations involved, so if one needs to know the throughput of a particular intrinsic on a particular GPU it would be best to simply measure it.

The core approximations used for the transcendental functions in the CUDA math library are pretty much all straightforward polynomial minimax approximation. These approximations were generated with the Remez algorithm, ready-to-use versions of which are provided by software like Mathematica and Maple. The argument reductions usually follow standard approaches, the references for any special techniques used are noted in comments inside the header files math_functions.h (single precision) and math_functions_dbl_ptx3.h (double precision) that are part of the CUDA distribution.

As a general starting point for floating-point computations, I usually recommend Muller et. al. “Handbook of Floating-Point Arithmetic”: [url]http://perso.ens-lyon.fr/jean-michel.muller/Handbook.html[/url]

Regarding books useful for the development of one’s own transcendental function implementations, I gave a short overview in the following thread on Stackoverflow:
[url]http://stackoverflow.com/questions/99620/books-on-the-algorithims-needed-for-calculating-trancendental-functions/7464239#7464239[/url]

Are the SIMD-in-a-word Video Instructions performed by the SFU? By the FP64 cores? Maybe even the LD/ST unit? Or some other undocumented unit?

Apparently they are not executed by the main FP32 cores, since their use does not impact integer addition throughput. This was discussed at the great GTC talk on accelerating Smith Waterman matching.

I’m curious because I keep experimenting to optimize integer throughput, and more understanding of the hardware is always helpful!