I’m currently working on a small and simple Cuda interval library (a bachelor thesis) and need some help with the rounding modes.
The Programming Guide states:
and later on in the appendix:
My question is, how can I “statically” set the rounding mode to “round-towards-zero”?
I know, there are C intrinsics with all 4 rounding modes. Yet being intrinsics/functions they are very slow.
And a little bit off-topic:
I tried to multiply large float numbers with __fmul_rz,
e.g. __fmul_rz(2x10^32, 2x10^32)
and the expected result should be +infinity, however the result is the number “below infinity”,
i.e the highest possible float: 3.4028235x10^38
It seems as __fmulr_rz would “round down” infinity. I’m not sure if this is my fault or not, because I am using the 3.0 SDK Debug Emulator (which is deprecated)
The statement cited above applies to single-precision addition and multiplication on sm_1x hardware. For sm_2x platforms, single-precision addition, multiplication, and fused-multiply-add with all four IEEE rounding modes are supported directly in hardware. To achieve a uniform interface at the CUDA C level, fadd_ru(), fadd_rd(), fmul_ru(), and fmul_rd() are emulated in software for sm_1x platforms and therefore slow.
The following paper explains how to create an interval library for sm_1x platforms using only the round-to-zero and round-to-nearest rounding modes supported by hardware:
Sylvain Collange, Jorge Flórez, David Defour
A GPU interval library based on Boost.Interval
8th Conference on Real Numbers and Computers, Santiago de Compostela : Spain (2008)
This is the expected answer according to the IEEE-754 standard. Note that this is also consistent with interval arithmetic.
If your multiplication example is computed using interval arithmetic (single-precision), the resulting interval will be [3.4028235x10^38, +infinity]. It contains the exact result 4x10^64.
Returning [+infinity, +infinity] in this case would break the containment property (assuming we manage to properly define what [+inf,+inf] means…)
If you are only targeting sm_20 platforms, you can also have a look at the Interval sample in the CUDA SDK 3.2.