Unit interval 16-bit (1.15) float hack

I haven’t used this code yet but thought it was kind of an interesting hack… I’ll leave it here.

I’m guessing others have come up with similar hacks but I was too lazy to look. :)

Problem:
I’d like to somehow pack two 32-bit floats into a 32-bit word.

I know the floats are saturated with range [0.0,1.0].

Losing lots of precision over time is entirely OK.

Solution:
The trick is to recognize that the middle 16 bits of an IEEE-754 32-bit float are where the hack should focus (marked in blue):

Notice the bit patterns between 1.0f and 1.9999999f only vary in the mantissa:

1.0000000 = 0x3f800000
1.9999999 = 0x3fffffff
2.0000000 = 0x40000000

Then notice that:

2.0 - 1.0 = 0x3f800000

A hack presents itself…

If you shift [0.0,1.0] to [1.0,2.0], treat it as a 32-bit unsigned int, subtract 0x3f800000 and select the middle 16 bits of the word then you have 15 bits of mantissa and a leading bit that is only 1 if the original value was 1.0.

You can then select these middle 16 bits and place them in their destination:

void pxl_sat16v2_pack_lo(uint32_t* const sat16v2, const float v)
{
  const uint32_t t = __float_as_uint(v + PXL_SAT16_PACK_MANTISSA_BIAS) - 0x3f800000;
  asm("prmt.b32 %0, %1, %2, 0x3265;" : "=r"(*sat16v2) : "r"(*sat16v2), "r"(t));
}

Unpacking is similar… select a 16-bit unsigned short from the 32-bit word, add 0x3f800000, treat it as a float, subtract 1.0 and return a float.

float pxl_sat16v2_unpack_lo(const uint32_t sat16v2)
{
  uint32_t d;
  asm("prmt.b32 %0, %1, 0x0, 0x7104;" : "=r"(d) : "r"(sat16v2));
  return __uint_as_float(d + 0x3f800000) - 1.0f;
}

Is this a good hack? Is it faster than a fp32<>fp16 conversion?

I have no idea… I never used it or dwelled on its limitations but it might be useful to someone. :)

I dumped some source code here.

Consider adding 0x80 before subtracting 0x3f800000 to implement a ‘float’ to fixed-point conversion with rounding instead of truncation, for a bit of added accuracy. This does not incur additional runtime cost as the compiler will merge the two constants at compile time. Since the most significant byte is tossed out anyhow, subtracting 0x00800000 would suffice, but I see no advantage in that (the machine instruction can incorporate a full 32-bit immediate value).

Traditionally, the fastest way to read fp16 data was through a texture, as this automatically expands the data into a 32-bit ‘float’ during reads, without use of additional conversion instructions. Whether this still holds on Maxwell, I do not know, but I suspect it does.

One advantage of the proposed method is certainly that it also applies to writable data and doesn’t require the use of textures.

This is similar to the method I use in my Monte Carlo PRNG code to transform a random word of bits into a float or double in a 0…1 or -1…1 range.

In floating point, say we have 32 random bits in u32, and we want a random number from 0…1. There are two basic strategies to choose from for this transformation. We could just form u32/0xFFFFFFFF. This is easy and robust, and the most common transformation. It costs a int32->fp cvt instruction then an FP multiply, both of which are not too bad. It has a nice advantage in that we use extra entropy bits to get even more random sample density for small magnitude values of f in a fractal-like way… there’s 2^23 discrete output possibilities in (0.5,1.0), but there’s also the same sample count (with lower weight) even down at (0.5,1.0)*2^-9. This is really useful if you’re transforming the random number into another distribution using a mapping which is sensitive to the variate’s quantization at the tail(s) (such as taking a sample from a Poisson distribution with high lambda). But this output distribution asymmetry is also a flaw in that it’s no longer exactly uniform. One consequence is that the mean value is not exactly 0.5 as you’d expect.

The other strategy is the one I use, which is to treat the input bits as a fixed point number and convert that to a float. By setting the low bit of the fixed point value, the output f can be crafted to be symmetric around 0.5 and avoid both 0.0 and 1.0. This means the output random variate f has robust statistical attributes. Its mean is exactly 0.5 (unlike the first strategy), is evenly sampled everywhere, and is symmetric so (1.0-f) has the exact same distribution as f itself. This is the behavior that works best for my single precision Monte Carlo work. (For double precision, the large number of bits makes the statistical quality difference almost unmeasurable, in which case the fixed point method wins because it’s faster.)

So getting to the transformation, my fixed point to floating point conversion uses Allan’s style of IEEE bit representation surgical manipulation.
I use 23 bits to form the fixed point value, mask the low bit to 1 (to make it symmetric around 0.5 and exclude 0.0) and insert them into the mantissa of an IEEE floating point value between 1.0 and 2.0. This trick is exactly the one Allan mentioned… in that range the exponent is unchanging and only the mantissa bits vary. Finally I subtract 1.0f from the value to get my uniform 0…1 variate.
Outputting a random variate from -1 to 1 is similar… you just use a larger fp exponent to assemble an initial value from 2.0 to 4.0 and subtract 3.0. This final subtraction is usually free since after inlining the variate is usually remapped by the user anyway, and the compiler folds both transformations into a single FMA instruction.

The “surgical IEEE bits” assembly is faster than a native int->FP conversion on both CPU and GPU. In my code (maintained for both CPU and GPU) I use two masks to assemble the output but I see now I should switch to Allan’s prmt.b32 method on the GPU.

It was @spworley who gave me the original 1.0-2.0 hint back in February!

Semi-related historical fact: apparently Adobe Photoshop uses/used a quasi 15-bit representation for deep color.