How to avoid divergence when using "?:" operator for short statements

Using nvvp, I have identified two lines in my kernel that gave me high divergence.

The two lines are actually pretty simple, the first line, I want to find the location of the minimum value from a float3 vector. I used

minval=fminf(fminf(htime[0],htime[1]),htime[2]);
(*minloc)=(minval==htime[0]?0:(minval==htime[1]?1:2));

where htime is a float pointer point to 3 numbers in the register space.

The “?:” operation in the second line caused divergence 90% of the time.

the second one is slightly more complicated, I call a custom math function (to replace nextafter) inside another “?:” operator:

(*minloc==0) ?
          (htime[0]=mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f))) :
	  ((*minloc==1) ? 
	      (htime[1]=mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f))) :
	      (htime[2]=mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f))) );

this again gave me 90% divergence.

In addition, the short function (23 lines total) that contains both of these cases is a hotspot of my code, taking about 1/10 of the run-time. Using PC sampling profiling, the function poses 64% latency due to execution dependency, 23% due to instruction fetch. I suspect those were also caused by the two ?: operators above.

My questions are,

  1. is there a way to optimize the above code to avoid the divergence? I tried to use minloc as index to avoid the second ?:, but that makes my htime array in the local memory (instead of a register).

  2. even I can find a way to avoid divergence in the above cases, do you think it will likely make a major impact to the execution efficiency? the expressions involved are kind of short.

happy to hear what you think about this.

FYI, the function that I am interested in optimizing can be found here

Are you building with full optimizations? Have you checked the SASS? I am a bit surprised that you don’t get branchless code for the first idiom. It may have to do with the compiler having to assume possible aliasing for the array references, in particular htime.

So before you do anything else, I would highly recommend you use ‘const’ and ‘restrict’ with all pointer arguments as appropriate. For example:

hitgrid(const float3 * __restrict__ p0, const float3 * __restrict__ v, float * __restrict__ htime, const float * __restrict__ rv, int * _restrict__ id){

If that doesn’t improve the generated code, I’d suggest to try this: copy htime to three local variables, do all computations on those, at the end copy the final results back to htime.

I am surprised about that too. A shown in the attached screenshot in the first post, nvvp reported two divergence inside this function (line#687), these are the only two places that may cause divergence, unless I misunderstood nvvp’s report.

thanks for your quick reply. I tried both of your suggestions, unfortunately they didn’t seem to make a difference - either in terms of the overall simulation speed, or the profiler output. nvvp still reported two 90% divergence inside this function.

That does not sound right. I wonder whether we are looking at the correct code here. Could you show the machine code generated for this function for your release build (I tried to build it myself, but it seems there are too many dependencies on other code to compile this function quickly). You can get the machine code from cuobjdump --dump-sass.

[Later:] Never mind, I got the code to compile with a bit of scaffolding. After cleaning up mcx_nextafterf() to this:

__device__ inline float mcx_nextafterf(float a, int dir){
   float num = a + gcfg->maxvoidstep;
   num = __int_as_float (dir ^ (__float_as_int (num) & 0x80000000U));
   return num - gcfg->maxvoidstep;
}

it is down to three branches, which appears to correspond to this code:

(*id==0) ?
          (htime[0]=mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f))) :
	  ((*id==1) ? 
	      (htime[1]=mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f))) :
	      (htime[2]=mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f))) );

That code appears to be too voluminous to be predicated. This can be fixed by pulling the computation out of the conditional clauses into unconditional code:

t0 = mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f));
t1 = mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f));
t2 = mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f));
    
if (*id == 0) htime[0] = t0;
if (*id == 1) htime[1] = t1;
if (*id == 2) htime[2] = t2;

Compiled for sm_50, the branches are gone. Here is the revised code as cut & paste from my scaffolding:

__device__ inline float mcx_nextafterf(float a, int dir)
{
    float num = a + gcfg->maxvoidstep;
    num = __int_as_float (dir ^ (__float_as_int (num) & 0x80000000U));
    return num - gcfg->maxvoidstep;
}

__device__ inline float hitgrid (const float3 * __restrict__ p0, 
                                 const float3 * __restrict__ v, 
                                 float *  __restrict__ htime, 
                                 const float *  __restrict__ rv, 
                                 int * __restrict__ id)
{
    float dist, t0, t1, t2;
    
    //time-of-flight to hit the wall in each direction
    htime[0]=fabs((floorf(p0->x)+(v->x>0.f)-p0->x)*rv[0]); // absolute distance of travel in x/y/z
    htime[1]=fabs((floorf(p0->y)+(v->y>0.f)-p0->y)*rv[1]);
    htime[2]=fabs((floorf(p0->z)+(v->z>0.f)-p0->z)*rv[2]);
    
    //get the direction with the smallest time-of-flight
    dist=fminf(fminf(htime[0],htime[1]),htime[2]);
    (*id)=(dist==htime[0]?0:(dist==htime[1]?1:2));
    
    // p0 is inside, p is outside, move to the 1st intersection pt, now in 
    // the air side, to be corrected in the else block
    htime[0]=p0->x+dist*v->x;
    htime[1]=p0->y+dist*v->y;
    htime[2]=p0->z+dist*v->z;
    
    t0 = mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f));
    t1 = mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f));
    t2 = mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f));
    
    if (*id == 0) htime[0] = t0;
    if (*id == 1) htime[1] = t1;
    if (*id == 2) htime[2] = t2;
    
    return dist;
}

A word of general advice. In my experience it is often better from a performance perspective to rely on the compiler’s if-conversion rather than trying to preempt the compiler optimization by using the ternary operator in the source code. This often results in clearer code as well.

A particularly suitable pattern is to compute unconditionally and then make a late selection via if-statement(s), as I have done here. If one path is exceptional (a low-probability slow path) a good idiom is to compute the common path (high-probability fast path) unconditionally, then conditionally invoke the slow path at the end, letting it overwrite the fast-path result.

Can you do a reduction over the float3 as an iterable and get the min or max from that? I think there’s unconditional min/max operations that you can use.

thank you so much! I tested your suggested changes, the divergence inside this function has indeed gone!

only one issue, with your modified mcx_nextafter() function, I was not able to get correct simulation results - the total absorption fraction printed at the end of the simulation became 20.3% instead of the expected 17.7%. I reverted the changes to mcx_nextafter() back to my original version, the output now correct. There seems to be no speed penalty using my original version.

here is my committed changes:

with this update, the simulation speed was improved by 6% on my Pascal. so, indeed, this helps.

I suspect the modified mcx_nextafter() may have some issues when dealing with the sign bit. In my original version, I used “unsigned int” to map the float, in your version, you used __float_as_int and __int_as_float, both seems to handle int instead uint. I changed the line to

num = __int_as_float ((uint)(dir ^ ((uint)__float_as_int (num) & 0x80000000U)));

but still I was not able to get the same absorption fraction.


Some additional observations:

If I change the 3 if-statements at the end of hitgrid into indexed array access, like this

htime[*id] = xi[*id];

the speed dropped by over 30%. I am sure this was because nvcc moved htime/xi to the local memory instead of register.

If I change the 3 if-conditions into another ?: tertiary operator, like

(*id == 0)? (htime[0] = xi[0]) : ((*id == 1)? (htime[1] = xi[1]):(htime[2] = xi[2]));

I did not get divergence, but the speed took a 1-2% hit.

thanks for your comment, from njuffa’s analysis above, it appears that the divergence only happens at the second ?: operator, because the mcx_nextafter is too big to be optimized.

However, I am interested to learn your trick if you don’t mind to share regarding min/max of short vectors.

I realized several misconceptions of mine from this exercise:

  1. I had always thought that the tertiary operator is recommended over short if/else blocks. my impression was, when ?: is used, all branches will be evaluated to avoid divergence. I remember reading this in the older versions of cuda best practices guide. Perhaps this only applies to short statements?

  2. because my program is register-bound (now it uses 64 or more registers), so, I wanted to minimize the use of registers. In your change, you added 3 local registers - t1,t2,t3. I thought this will further restrict how many blocks can be executed in an SM. But from nvvp’s report, it appears that the total registers needed are remain the same as 64.

what exactly happened when you declare additional registers inside an inline function? will it further increase the register pressure or the GPU can swap other inactive registers to hold these additional registers in the scope of the inline function?

  1. can you elaborate on the mentioned “if-conversion”? under what condition does this conversion happen? can I use if()elseif()else() ? does it apply to slightly more complex branches, such as
if(*id==0)  htime[0]=mcx_nextafterf(__float2int_rn(htime[0]), (v->x > 0.f)-(v->x < 0.f));
if(*id==1)  htime[1]=mcx_nextafterf(__float2int_rn(htime[1]), (v->y > 0.f)-(v->y < 0.f));
if(*id==2)  htime[2]=mcx_nextafterf(__float2int_rn(htime[2]), (v->z > 0.f)-(v->z < 0.f));

and get rid of t1/t2/t3?

thanks again

Sorry, I goofed when I modified mcx_nextafterf(), I mistakenly read a “+=” as “=”. This version should work:

__device__ inline float mcx_nextafterf(float a, int dir)
{
    float num = a + gcfg->maxvoidstep;
    uint inum = __float_as_int (num);
    num = __int_as_float (inum + (dir ^ (inum & 0x80000000U)));
    return num - gcfg->maxvoidstep;
}

My main point is to avoid archaic union-based type punning techniques where CUDA offers dedicated re-interpretation intrinsics. In my tests, this change away from the union removed two branches from the generated code.

if-conversion is the process whereby the compiler convert if-statements into either a sequence of predicated instructions, or select-type operations (basisally, the hardware equivalent of the ternary operator), or a combination thereof. Writing code in a natural style with if-statements usually gives the compile better opportunity to convert conditional code into efficient branchless machine code.

If you make the change indicated in your number (3), you will likely find that the resulting code remains branchless (at least it does for me on CUDA 7.5 for sm_50), but that it requires more instructions than the idiom I proposed. From a quick look at the SASS it seems the difference is that the address computations for accesses to htime are now conditional and the code for it no longer shared.

The compiler uses heuristics that determine whether it is better to use a branch or branchless idioms, and these (architecture dependent) heuristics are quite appropriate from what I can tell. Since the heuristics are not publicly documented (and subject to change) you cannot predict with certainty when the compiler will use a branch instead of predication, but clearly there is a cutoff when the predicated sequence gets too long. Inspect the SASS to see whether branches remain. Note: Depending on the probability of divergence (which the compiler cannot know since it is a run-time effect) branchy code can actually be better than branchless code, it really depends on the individual case.

There is no general relationship between the number of declared variables in source code and register usage. The compiler typically needs to create many temporary variable, for example for the address computations in this code. By adjusting code scheduling and variable live ranges it can tune the register usage, and the heuristics used for that work well for 90+% of code. In general, with modern GPU architectures and recent (past two, three years) compilers, programmers do not have to worry about register pressure issues very much; where it turns out to be necessary, I would recommend use of __launch_bounds() to control it.