Curious handling of infinity by CUDA compiler

In trying to find a compact idiom for floating-point infinity for use in host device code, I observed a curious transformation applied by the front half of the CUDA compiler. I am using the CUDA 6.5 tool chain and would be interested to know whether this still reproduces with CUDA 7.5 RC if anybody here is inclined to give this a try. I started out with code handling ‘float’ data:

__global__ void kernel (float *out, const float *in)
{
    *out = 1e38f * 1e38f;
}

The resulting PTX code is as I expected it, storing the single-precision infinity encoding 0x7F800000 into the output location:

.visible .entry _Z6kernelPfPKf(
.param .u64 _Z6kernelPfPKf_param_0,
.param .u64 _Z6kernelPfPKf_param_1
)
{
.reg .s32 %r<2>;
.reg .s64 %rd<3>;

ld.param.u64 %rd1, [_Z6kernelPfPKf_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r1, 2139095040;
st.global.u32 [%rd2], %r1;
ret;
}

So I tried the analogous double-precision case,

__global__ void kernel (double *out, const double *in)
{
    *out = 1e308 * 1e308;
}

and was blown away to find the compiler emits code that computes infinity at runtime as 1.0/0.0!

.visible .entry _Z6kernelPdPKd(
.param .u64 _Z6kernelPdPKd_param_0,
.param .u64 _Z6kernelPdPKd_param_1
)
{
.reg .s64 %rd<3>;
.reg .f64 %fd<3>;

ld.param.u64 %rd1, [_Z6kernelPdPKd_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.f64 %fd1, 0d0000000000000000;
rcp.rn.f64 %fd2, %fd1;
st.global.f64 [%rd2], %fd2;
ret;
}

The resulting SASS also contains this computation.

My current best attempt at a portable compact idiom for infinity is ((float)1e308*(float)1e308) but I have not fully tested that yet across all platforms.

Here’s the SASS followed by PTX for each kernel…

CUDA 7.5 / SM_52 / float:

code for sm_52
		Function : _Z6kernelPfPKf
	.headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"

        /*0008*/                   MOV R1, c[0x0][0x20];
        /*0010*/                   MOV R0, c[0x0][0x140];
        /*0018*/                   MOV32I R2, 0x7f800000;

        /*0028*/                   STG [R0], R2;
        /*0030*/                   EXIT;
        /*0038*/                   BRA 0x38;
		...............................
{
	.reg .b32 	%r<4>;

	ld.param.u32 	%r1, [_Z6kernelPfPKf_param_0];
	cvta.to.global.u32 	%r2, %r1;
	mov.u32 	%r3, 2139095040;
	st.global.u32 	[%r2], %r3;
	ret;
}

CUDA 7.5 / SM_52 / double:

code for sm_52
		Function : _Z6kernelPdPKd
	.headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"

        /*0008*/                   MOV R1, c[0x0][0x20];
        /*0010*/                   CAL 0x38;
        /*0018*/                   MOV R0, c[0x0][0x140];

        /*0028*/                   STG.64 [R0], R2;
        /*0030*/                   EXIT;
        /*0038*/                   DSETP.GTU.AND P0, PT, RZ, +INF , PT;

        /*0048*/              @!P0 MOV32I R2, 0x0;
        /*0050*/              @!P0 MOV32I R3, 0x7ff00000;
        /*0058*/               @P0 MOV32I R2, 0x0;

        /*0068*/         {     @P0 MOV32I R3, 0x80000;
        /*0070*/                   RET;        }
        /*0078*/                   BRA 0x78;
		...............................
{
	.reg .b32 	%r<3>;
	.reg .f64 	%fd<3>;

	ld.param.u32 	%r1, [_Z6kernelPdPKd_param_0];
	cvta.to.global.u32 	%r2, %r1;
	mov.f64 	%fd1, 0d0000000000000000;
	rcp.rn.f64 	%fd2, %fd1;
	st.global.f64 	[%r2], %fd2;
	ret;
}

Thanks for giving it a try with CUDA 7.5 RC, looks like nothing has changed. My new idiom seems to work well with the host compilers I have tried so far. Unfortunately, my attempts to suppress warnings about overflow in constant arithmetic locally using #pragma have failed so far. Not too surprising, #pragma is an evil hack that has a tendency to be buggy.

[Later:] Sometimes I am thinking too complicated. The following seems to work just fine in kernels templated to use either ‘float’ or ‘double’, in both host and device code, and without warnings:

#define MY_INFINITY (1e38f*1e38f)

Interesting, nice find! Note that ptxas on CUDA 7.5 has actually specialized (but not inlined) the division and optimized away most of its code. So it was almost able to recognize the “1.0/0.0” idiom.
I interpret the SASS code as:

if(isnan(0.0)) // 0.0 > or unordered with +Inf
  return longlong_as_double(0x0008000000000000); // 2^-1023
else
  return INFINITY;

It seems constant propagation is doing a good job, but is struggling with the unordered comparison with infinity (isnan idiom?). Looks like an example of what you call an “impedance mismatch” in the tool chain.

Any idea about where the 0x0008000000000000 (2^-1023) immediate comes from? Some expression in which NaN was replaced by 0?

I sure noticed how aggressive CUDA 7.5 applies constant propagation on the double-precision reciprocal code. While divisions and reciprocals are represented as instructions at the PTX level, these are really just macros expanding into emulation sequences, so the optimization stages of PTXAS can only optimize the resulting sequence of individual machine instructions, they have no concept of a double-precision reciprocal.

I was wondering about the 0x0008000000000000 literal constant as well and could not come up with an explanation after some thought. But since that value is “unreachable” it is an X (don’t care) operand and could be anything.

I was just flabbergasted that apparently NVVM recognizes the infinity resulting from the overflow in the multiply, but instead of transforming it into a simple constant turns it into the computation 1.0 / 0.0. While the transformation is valid numerically, it just makes no sense to me. It would have even been better to retain the original multiply as coded.

I suspect that NVVM inherited this transformation from LLVM. I had previous experiences of LLVM applying some transformation just because it can, not because it made any sense or provided any advanatage (and in fact making code slower on the majority of processors). Just a hypothesis, though.