Math.pow(double, 2.0) has extremely low branches efficiency, but Math.pow(double, 2) is fast.

I’m learning CUDA and getting familiar with the Visual Studio NSight Performance Analysis tools. I implemented a very naive Sobel edge finder. The goad was to make it work, and then use the Performance Analysis to improve performance.

I ran the Performance Analysis tool with all of the Source experiments selected (Instruction Count, Divergent Branch, and Memory Transfer). The Divergent Branch showed some branches with 0, 0.6, 18, etc. all pointing to:

__MATH_FUNCTIONS_DBL_PTX3_DECL__ double pow(double a, double b)
{
    return __nv_pow(a,b);
}

in math_functions_dbl_ptx3.hpp.

In my kernel, I was using the Pythagorean theorem like so:

.
.
.
int newPixelx = 0;
int newPixely = 0;
for (unsigned char y = 0; y < 3; y++)
{
   for (unsigned char x = 0; x < 3; x++)
   {
      newPixelx += (Gx[x][y] * subImage[x][y]);
      newPixely += (Gy[x][y] * subImage[x][y]);
   }
}
double newPixel = sqrt( (pow( (double)newPixelx, 2.0 ) + pow( (double)newPixely, 2.0 )) );

I remove the ‘.0’ from the '2’s in line 14:

double newPixel = sqrt( (pow( (double)newPixelx, 2 ) + pow( (double)newPixely, 2 )) );

And now there’s 100% Branches Efficiency, with no divergence in math_functions_dbl_ptx3.hpp. Those calls aren’t showing up in the Divergent Branch results. Also, kernel execution time (on a 1920x1200 image) dropped from ~30ms to 1ms, with the same grid and block size.

What could cause such major divergence between calling pow with ‘2.0’ vs ‘2’?

System Specs:
Windwos 7 Pro 64 bit.
GTX 750ti
NSight 4.6.0.15071
Driver 347.62

Unverified hypothesis: the compiler replaces the “integer” version with newPixel*newPixel.

2.0 likely triggers the double precision version of the pow function

2.0f would be using single precision code (assuming both arguments to pow are single precision)

I have no idea if specific optimizations for integer exponents exist (it would make sense though…)

On consumer Geforce cards (except specific Titan models) you’ll want to avoid all uses of double precision unless you really need the precision.

For comparison’s sake, I did replace the pow with newPixel*newPixel, with the same timing results.

Up front: For the specific computation in the above snippet of code, I would recommend using the hypot() function.

That said: pow(double,double) and pow(double,int) are definitely two different code paths.

pow(double,double) is a generic function that delivers accurate results for any combination of arguments. This means it involves significant computation, some of it using extended precision, plus significant overhead for checking the many special cases called out by language standards. This maps to code in the CUDA math library. Like pow(double,double) in other math libraries, this is not exactly a speed demon, although it is pretty well optimized at this point.

A cheaper alternative to pow(double,double) in terms of computation time is the use of exp(y*log(x)) instead of pow(x,y), with the former running at about twice the speed of the latter with many math libraries, including CUDA’s. The downside of this substitution is that it handles many special cases incorrectly and can incur significant error when ‘y’ is large.

If I recall correctly, pow(double,int) actually maps to a macro or template in the host’s math.h header file which is included by CUDA. This means it is resolved into discrete code before it ever reaches the CUDA toolchain, and CUDA is at the mercy of the host’s implementation. pow(x,2) may or may not map to xx. You could check the disassembled machine code with cuobjdump --dump-sass to find out. As a conservative approach, I would suggest writing xx explicitly where best performance is desired. Note also that common methods employed to compute pow(double,int) could incur fairly large error if the second argument is large.