NVIDIA Visual Profiler fails to detect divergent branch

Hi everyone,

I'm currently testing the Visual Profiler, with the little projects given with the SDK. My graphics card is a Quadro FX 1800, whose compute capability is 1.1.
In the MatVecMul project, several kernels are implemented to compute the product between a matrix and a vector. Let's focus on the first kernel : MatVecMulUncoalesced0.

[code]// Matrix multiplication kernel called by MatrixMul()
__kernel void MatVecMulUncoalesced0(const __global float* M,
const __global float* V,
uint width, uint height,
__global float* W)
{
// Row index
uint y = get_global_id(0);
if (y < height) {

// Row pointer
const __global float* row = M + y * width;

// Compute dot product
float dotProduct = 0;
for (int x = 0; x < width; ++x)
dotProduct += row[x] * V[x];

// Write result to global memory
W[y] = dotProduct;
}
}[/code]

There is a very simple branching in this code : if the thread's global ID (y) is greater than the height of the matrix, do nothing. Else compute the scalar product between the yth line of the matrix and the vector.
The work group size is 256, and the host code creates as many work groups to finally obtain a global thread count greater than the height of the matrix.
In my opinion, there will be at least 1 divergent branch (except if the height of the matrix is a multiple of 32). It will occur in the warp containing the thread which global ID equals "height".

However, my matrix height is not a multiple of 32, and no divergent branch is detected in the profiler.

Is this a bug in the profiler ? Or my misunderstanding ?

I'm currently testing the Visual Profiler, with the little projects given with the SDK. My graphics card is a Quadro FX 1800, whose compute capability is 1.1.

In the MatVecMul project, several kernels are implemented to compute the product between a matrix and a vector. Let's focus on the first kernel : MatVecMulUncoalesced0.

// Matrix multiplication kernel called by MatrixMul()

__kernel void MatVecMulUncoalesced0(const __global float* M,

const __global float* V,

uint width, uint height,

__global float* W)

{

// Row index

uint y = get_global_id(0);

if (y < height) {

// Row pointer

const __global float* row = M + y * width;

// Compute dot product

float dotProduct = 0;

for (int x = 0; x < width; ++x)

dotProduct += row[x] * V[x];

// Write result to global memory

W[y] = dotProduct;

}

}

There is a very simple branching in this code : if the thread's global ID (y) is greater than the height of the matrix, do nothing. Else compute the scalar product between the yth line of the matrix and the vector.

The work group size is 256, and the host code creates as many work groups to finally obtain a global thread count greater than the height of the matrix.

In my opinion, there will be at least 1 divergent branch (except if the height of the matrix is a multiple of 32). It will occur in the warp containing the thread which global ID equals "height".

However, my matrix height is not a multiple of 32, and no divergent branch is detected in the profiler.

Is this a bug in the profiler ? Or my misunderstanding ?

I'm currently testing the Visual Profiler, with the little projects given with the SDK. My graphics card is a Quadro FX 1800, whose compute capability is 1.1.

In the MatVecMul project, several kernels are implemented to compute the product between a matrix and a vector. Let's focus on the first kernel : MatVecMulUncoalesced0.

[code]// Matrix multiplication kernel called by MatrixMul()

__kernel void MatVecMulUncoalesced0(const __global float* M,

const __global float* V,

uint width, uint height,

__global float* W)

{

// Row index

uint y = get_global_id(0);

if (y < height) {

// Row pointer

const __global float* row = M + y * width;

// Compute dot product

float dotProduct = 0;

for (int x = 0; x < width; ++x)

dotProduct += row[x] * V[x];

// Write result to global memory

W[y] = dotProduct;

}

}[/code]

There is a very simple branching in this code : if the thread's global ID (y) is greater than the height of the matrix, do nothing. Else compute the scalar product between the yth line of the matrix and the vector.

The work group size is 256, and the host code creates as many work groups to finally obtain a global thread count greater than the height of the matrix.

In my opinion, there will be at least 1 divergent branch (except if the height of the matrix is a multiple of 32). It will occur in the warp containing the thread which global ID equals "height".

However, my matrix height is not a multiple of 32, and no divergent branch is detected in the profiler.

Is this a bug in the profiler ? Or my misunderstanding ?

Thank you all

I'm currently testing the Visual Profiler, with the little projects given with the SDK. My graphics card is a Quadro FX 1800, whose compute capability is 1.1.

In the MatVecMul project, several kernels are implemented to compute the product between a matrix and a vector. Let's focus on the first kernel : MatVecMulUncoalesced0.

There is a very simple branching in this code : if the thread's global ID (y) is greater than the height of the matrix, do nothing. Else compute the scalar product between the yth line of the matrix and the vector.

The work group size is 256, and the host code creates as many work groups to finally obtain a global thread count greater than the height of the matrix.

In my opinion, there will be at least 1 divergent branch (except if the height of the matrix is a multiple of 32). It will occur in the warp containing the thread which global ID equals "height".

However, my matrix height is not a multiple of 32, and no divergent branch is detected in the profiler.

Is this a bug in the profiler ? Or my misunderstanding ?

Thank you all