__device__ functions create divergent branches
Hello,

my kernel code makes use of a __device__ function to calculate gaussian coefficients. Profiling that kernel, I can see that this kernel produces an instruction replay overhead of about 20% due to divergent branches. Removing the __device__ function (by manually inlining it), the kernel does [b]not[/b] have any divergent branches. Why is that?

How are __device__ functions actually called? Why could that lead to any branches?

Regards,
Azr@el
Hello,



my kernel code makes use of a __device__ function to calculate gaussian coefficients. Profiling that kernel, I can see that this kernel produces an instruction replay overhead of about 20% due to divergent branches. Removing the __device__ function (by manually inlining it), the kernel does not have any divergent branches. Why is that?



How are __device__ functions actually called? Why could that lead to any branches?



Regards,

Azr@el

#1
Posted 01/27/2012 11:36 AM   
It would help if we could see the code for the device function. Two potential reasons (off the top of my head, there may be others):

(1) the device function contains branches (if-then-else, loops, etc)
(2) the device function call a math function that contains branches

[later:]

Hm, I see that you state that manually inlining the function removed the divergence. That would suggest that there may be a branch upstream that cannot be removed as long as there is a function call, but after inlining the compiler is able to if-convert that branch, e.g. by using predication. Without having compilable code one can only speculate. You could disassemble the binary with cuobjdump to see where branches are being used in the generated code.

What happens if you use __forceinline__ on the device function, instead of manually inlining it?
It would help if we could see the code for the device function. Two potential reasons (off the top of my head, there may be others):



(1) the device function contains branches (if-then-else, loops, etc)

(2) the device function call a math function that contains branches



[later:]



Hm, I see that you state that manually inlining the function removed the divergence. That would suggest that there may be a branch upstream that cannot be removed as long as there is a function call, but after inlining the compiler is able to if-convert that branch, e.g. by using predication. Without having compilable code one can only speculate. You could disassemble the binary with cuobjdump to see where branches are being used in the generated code.



What happens if you use __forceinline__ on the device function, instead of manually inlining it?

#2
Posted 01/27/2012 09:58 PM   
I apologize for my late reply.

The device function is incredibly simple:

[code]
__device__ float __getGaussCoeffs( int x, float deviation )
{
return expf(-0.5f * x * x / deviation / deviation);
}
[/code]

The Kernel code contains a for loop whose number of iterations is constant, and an if conditional whose condition is [code]y >= 0 && y < height && x - kernel_radius >= 0 && x + kernel_radius < width[/code] where x and y are the coordinates of the thread in relation to the grid.

I've not been able to profile the kernel with __forceinline__ yet, but executing the code, __forceinline__ does not affect the runtime.

Looking at the disassembly of my ptx code, how can i detect a divergent branch?
Does expf result in a branch? If so, can i take it as a rule of thumb to avoid expf (or branching intrisics in general) within __device__ functions?

Regards,
Azr@el

PS: In case it is of interest, i'm using CUDA 4.1 on cards with capability 1.0, 1.3 and 2.0
I apologize for my late reply.



The device function is incredibly simple:





__device__ float __getGaussCoeffs( int x, float deviation )

{

return expf(-0.5f * x * x / deviation / deviation);

}




The Kernel code contains a for loop whose number of iterations is constant, and an if conditional whose condition is
y >= 0 && y < height && x - kernel_radius >= 0 && x + kernel_radius < width
where x and y are the coordinates of the thread in relation to the grid.



I've not been able to profile the kernel with __forceinline__ yet, but executing the code, __forceinline__ does not affect the runtime.



Looking at the disassembly of my ptx code, how can i detect a divergent branch?

Does expf result in a branch? If so, can i take it as a rule of thumb to avoid expf (or branching intrisics in general) within __device__ functions?



Regards,

Azr@el



PS: In case it is of interest, i'm using CUDA 4.1 on cards with capability 1.0, 1.3 and 2.0

#3
Posted 01/30/2012 02:18 PM   
Do you see the same behavior with CUDA 4.0?
Do you see the same behavior with CUDA 4.0?

GPU Developer at AccelerEyes ([email="support@accelereyes.com"]Email me[/email])

#4
Posted 01/30/2012 03:54 PM   
Yes, i did.

The findings from the original post were based on 4.0.
Yes, i did.



The findings from the original post were based on 4.0.

#5
Posted 01/30/2012 05:11 PM   
Performance side note. I would suggest replacing the dual division

[code]
return expf(-0.5f * x * x / deviation / deviation);
[/code]
with a division by the square of "deviation", unless the square overflows the single-precision range:

[code]
return expf(-0.5f * x * x / (deviation * deviation));
[/code]

The compiler cannot make that transformation automatically as it is not a identity transformation (e.g. consider x = 1e19, deviation = 1e25).

Without seeing the complete code, and without knowing how that code is being compiled, I can only speculate where possibly divergent branches exist in the code. In particular, one would have to study the machine code by looking at the disassembly produced by cuobjdump. A non-uniform branch can diverge, whether it will in fact diverge depends on the data encountered by the branch. In the code snippet shown, the comparisons that involve kernel_radius look like data-dependant branches that are potentially divergent.

If I remember correctly, expf() is a branchless code sequence at the machine code level across all architectures. For sm_2x and higher, the default (IEEE-754 compliant) single-precision division is a called subroutine that contains a branch to separate out special cases. That data-dependent branch will be divergent if both regular and special cases are encountered within the same warp. Special cases involve operands at the very ends of the exponent range, i.e. close to underflow and close to overflow, zero, infinity, NaN.

With the SIMT architecture of the GPU, possible divergence is a fact of life, and the compiler typically handles this appropriately via a combination of predication, select-type instructions, uniform branches, and regular branches. Trying to avoid divergence does not necessarily improve performance. For example, avoiding divergence if-then-else by computing both paths and selecting the result at the end with a ternary operator ensures that all threads incur the cost of executing both sides, so no performance is gained.
Performance side note. I would suggest replacing the dual division





return expf(-0.5f * x * x / deviation / deviation);


with a division by the square of "deviation", unless the square overflows the single-precision range:





return expf(-0.5f * x * x / (deviation * deviation));




The compiler cannot make that transformation automatically as it is not a identity transformation (e.g. consider x = 1e19, deviation = 1e25).



Without seeing the complete code, and without knowing how that code is being compiled, I can only speculate where possibly divergent branches exist in the code. In particular, one would have to study the machine code by looking at the disassembly produced by cuobjdump. A non-uniform branch can diverge, whether it will in fact diverge depends on the data encountered by the branch. In the code snippet shown, the comparisons that involve kernel_radius look like data-dependant branches that are potentially divergent.



If I remember correctly, expf() is a branchless code sequence at the machine code level across all architectures. For sm_2x and higher, the default (IEEE-754 compliant) single-precision division is a called subroutine that contains a branch to separate out special cases. That data-dependent branch will be divergent if both regular and special cases are encountered within the same warp. Special cases involve operands at the very ends of the exponent range, i.e. close to underflow and close to overflow, zero, infinity, NaN.



With the SIMT architecture of the GPU, possible divergence is a fact of life, and the compiler typically handles this appropriately via a combination of predication, select-type instructions, uniform branches, and regular branches. Trying to avoid divergence does not necessarily improve performance. For example, avoiding divergence if-then-else by computing both paths and selecting the result at the end with a ternary operator ensures that all threads incur the cost of executing both sides, so no performance is gained.

#6
Posted 01/30/2012 06:59 PM   
Well, here is my complete kernel code code:

[code]
__global__ void filter(const int width, const int height, unsigned char * image,
const size_t image_pitch, unsigned char * result, size_t result_pitch) {

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
float l_sum1 = 0.0f,
l_sum2 = 0.0f,
l_f_blur;

if (y >= 0 && y < height && x - KERNEL_RADIUS >= 0 && x + KERNEL_RADIUS < width)
{
for (int dx = x - KERNEL_RADIUS; dx <= x + KERNEL_RADIUS; dx++)
{
l_f_blur = __getGaussCoeffs(x - dx, SPATIAL_DEVIATION)
* __getGaussCoeffs(image[INDEX(x,y,image_pitch)] - image[INDEX(dx,y,image_pitch)], INTENSITY_DEVIATION);
l_sum1 += image[INDEX(dx,y,image_pitch)] * l_f_blur;
l_sum2 += l_f_blur;
}
result[ INDEX(x,y,result_pitch) ] = ((int)fabs( l_sum1 / l_sum2));
}
}
[/code]

SPATIAL_DEVIATION, INTENSITY_DEVIATION and KERNEL_RADIUS are macros that contain compile time constant values; INDEX is a macro that maps a block-relative thread index to grid-relative index.

nvcc Arguments are gencode=arch=compute_20,code=\"sm_20,compute_20\" --ptxas-options=-v
Well, here is my complete kernel code code:





__global__ void filter(const int width, const int height, unsigned char * image,

const size_t image_pitch, unsigned char * result, size_t result_pitch) {



int x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y * blockDim.y + threadIdx.y;

float l_sum1 = 0.0f,

l_sum2 = 0.0f,

l_f_blur;



if (y >= 0 && y < height && x - KERNEL_RADIUS >= 0 && x + KERNEL_RADIUS < width)

{

for (int dx = x - KERNEL_RADIUS; dx <= x + KERNEL_RADIUS; dx++)

{

l_f_blur = __getGaussCoeffs(x - dx, SPATIAL_DEVIATION)

* __getGaussCoeffs(image[INDEX(x,y,image_pitch)] - image[INDEX(dx,y,image_pitch)], INTENSITY_DEVIATION);

l_sum1 += image[INDEX(dx,y,image_pitch)] * l_f_blur;

l_sum2 += l_f_blur;

}

result[ INDEX(x,y,result_pitch) ] = ((int)fabs( l_sum1 / l_sum2));

}

}




SPATIAL_DEVIATION, INTENSITY_DEVIATION and KERNEL_RADIUS are macros that contain compile time constant values; INDEX is a macro that maps a block-relative thread index to grid-relative index.



nvcc Arguments are gencode=arch=compute_20,code=\"sm_20,compute_20\" --ptxas-options=-v

#7
Posted 01/31/2012 11:01 AM   
Scroll To Top