Hello,
I bumped into some strange issues while testing my HGEMM implementation. The tests pass against cuBLAS but fail against half.sourceforge.net. After hours of debugging I was able to get a minimum example (I apologize for the ugly-looking floats below – this is a very rare behavior that seems to happen less than 0.01% of the time on random computations)
Anyway, my understanding is that the following PTX code seems to exhibit unexpected rounding behavior:
.version 5.0
.target sm_60
.address_size 64
.entry debug(.param .u64 _dbg){
.reg .u64 %dbg;
.reg .b16 %hacc, %ha, %hb;
//Conversion actually doesn't occur because all these numbers can be represented exactly
cvt.rn.f16.f32 %hacc, 0.40576171875;
cvt.rn.f16.f32 %ha, 0.1300048828125;
cvt.rn.f16.f32 %hb, 0.2469482421875;
fma.rn.f16 %hacc, %ha, %hb, %hacc;
ld.param.u64 %dbg, [_dbg];
st.global.b16 [%dbg], %hacc;
}
*dbg = 0.437744140625
This seems incorrect:
The result of the FMA (with fp32 multiplication and addition as per the PTX documentation) is:
0.4378662109375,
which is evenly between the two consecutive half-floats:
0.437744140625 = 0.25 * 1.7509765625 – the fractional part is 1100000001 (odd)
0.43798828125 = 0.25 * 1.751953125 – the fractional part is 1100000010 (even)
It looks as though, in this particular case, the fma instruction gets confused and rounds to nearest-odd instead of nearest-even as specified by the PTX documentation.
As a side note, the normal cvt instruction seems to get it right:
.version 5.0
.target sm_60
.address_size 64
.entry debug(.param .u64 _dbg){
.reg .u64 %dbg;
.reg .b16 %hacc;
cvt.rn.f16.f32 %hacc, 0.4378662109375;
ld.param.u64 %dbg, [_dbg];
st.global.b16 [%dbg], %hacc;
}
*dbg = 0.43798828125
This does look like a bug to me, but I may be missing something on the behavior of HFMA.
Thanks!