How FP32 and FP16 units are implemented in GP100 GPU's

The GP100 GPU’s based on Pascal architecture has a performance of 10.6 Tflops of FP32 performance and 21.2 TFLops of FP16 performance. The representation of FP16 and FP32 numbers is quite different i.e. same number has different bit pattern in FP32 and FP16 (unlike integers where a 16-bit integer has same bit pattern even in 32-bit representation except for leading zeros).

How are the floating point units in GP100 implemented so that nearly twice the speedup is achieved by moving from FP32 to FP16.

They are represented as the corresponding IEEE-754 datatype indicates.

CUDA refers to this as the half datatype.

There is also a half2 vector type which is expected for max performance of some operations.

Refer to cuda_fp16.h header file.

[url]https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/[/url]

Thanks for the quick response.

My question was from an architecture perspective.

If GP100 supports half-precision units, then there is a piece of hardware that can “decode” the half-precision format and do the computation. Similarly, for single-precision, there should exist a piece of hardware that decodes IEEE-754 single precision format and performs some computation.

Are the half-precision computation units and single-precision computation units related in any way? More specifically, is one single precision computation unit composed of two half-precision computation units?

Can you kindly point me to any documentation related to this implementation?

I am not aware of NVIDIA documentation that explains the microarchitecture to that level. However, for recent generations of NVIDIA GPUs, the wide range of relative computational throughputs suggests that FP16, FP32, and FP64 units are built as separates entities, which allows NVIDIA to compose processors with the throughput profile required for particular market segments. This is conjecture, as I stated.

It is possible to build shared units to re-use relative expensive hardware like multiplier arrays, and various such schemes are described in the literature.

If this design decision (shared vs separate hardware for different floating-point formats) were documented, how would you take advantage of it?

I was just curious about the implementation as the performance of FP16 is exactly twice that of FP32.

While having separate computational units for FP16, FP32 and FP64 units is a possible option, it will only increase the silicon area.

The other option is to share the units but it may not result in exact scaling.

The design philosophy NVIDIA seems to use is to build a base configuration with full FP32 performance (which is needed for 3D graphics, their bread & butter business), then bolt on additional units (FP16, FP64) for professional markets, where the additional revenue per part (thousands of dollars) more than makes up for higher die costs (hundreds of dollars).

This approach allows their consumer line to compete on price, while allowing their professional line to compete on performance and features.

Again, conjecture on my part.

As njuffa points out, the actual implementation hardware is pretty much unimportant to us for programming since it’s all abstracted away from us. And NVidia rarely gives much detail.

But, for your very specific question of whether GP100’s FP16 and FP32 ALUs are shared in the same hardware sub unit, the GP100 whitepaper (surprisingly) does actually answer that exact question: “One new capability that has been added to GP100’s FP32 CUDA Cores is the ability to process both 16-bit and 32-bit precision instructions and data.”

Thanks for the pointer. I stand corrected.

Thanks for the inputs.