Understanding NVidia separable convolution example

Hi,

I have been trying to understand the separable convolution example (the one located in the OpenCL/src/oclConvolutionSeparable of the SDK),
and I am puzzled.

Let's look into the row convolution filter:
In oclConvolutionSeparable_launcher.cpp, we can see that the local work size will be:
ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y
and the global work size will be:
(imageW / ROWS_RESULT_STEPS ) * imageH;

That means that we will have in total
ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y * (imageW / ROWS_RESULT_STEPS ) * imageH
threads to compute imageW * imageH output samples.
With the values given, that means 16 * 4 * imageW/8 * imageH = 8 times more threads than samples to compute.
So with that, I would expect each thread to compute partial results, then combine the results in some way.
However, in the kernel (in ConvolutionSeparable.cl) I can not see anything like that.
What I even see is that every kernel writes multiple results:
[code]
for(int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++){
float sum = 0;

d_Dst[i * ROWS_BLOCKDIM_X] = sum;
}
[/code]
Again, with the values given (in oclConvolutionSeparable_launcher.cpp), then
every kernel will write out 8 values.

So to summarise, I can see 8 times more kernels launched than output samples to compute and each kernel writing 8 outputs values! I miss something there.

Does anyone understand this convolution sample? Could you please try to explain how it works?
Thanks a lot in advance.

sum += c_Kernel[KERNEL_RADIUS - j] * l_Data[get_local_id(1)][get_local_id(0) + i * ROWS_BLOCKDIM_X + j];

d_Dst[i * ROWS_BLOCKDIM_X] = sum;

}

Again, with the values given (in oclConvolutionSeparable_launcher.cpp), then

every kernel will write out 8 values.

So to summarise, I can see 8 times more kernels launched than output samples to compute and each kernel writing 8 outputs values! I miss something there.

Does anyone understand this convolution sample? Could you please try to explain how it works?

I have been trying to understand the separable convolution example (the one located in the OpenCL/src/oclConvolutionSeparable of the SDK),

and I am puzzled.

Let's look into the row convolution filter:

In oclConvolutionSeparable_launcher.cpp, we can see that the local work size will be:

ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y

and the global work size will be:

(imageW / ROWS_RESULT_STEPS ) * imageH;

That means that we will have in total

ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y * (imageW / ROWS_RESULT_STEPS ) * imageH

threads to compute imageW * imageH output samples.

With the values given, that means 16 * 4 * imageW/8 * imageH = 8 times more threads than samples to compute.

So with that, I would expect each thread to compute partial results, then combine the results in some way.

However, in the kernel (in ConvolutionSeparable.cl) I can not see anything like that.

What I even see is that every kernel writes multiple results:

[code]

for(int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++){

float sum = 0;

for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++)

sum += c_Kernel[KERNEL_RADIUS - j] * l_Data[get_local_id(1)][get_local_id(0) + i * ROWS_BLOCKDIM_X + j];

d_Dst[i * ROWS_BLOCKDIM_X] = sum;

}

[/code]

Again, with the values given (in oclConvolutionSeparable_launcher.cpp), then

every kernel will write out 8 values.

So to summarise, I can see 8 times more kernels launched than output samples to compute and each kernel writing 8 outputs values! I miss something there.

Does anyone understand this convolution sample? Could you please try to explain how it works?

Thanks a lot in advance.

Regards,

Babel.

I have been trying to understand the separable convolution example (the one located in the OpenCL/src/oclConvolutionSeparable of the SDK),

and I am puzzled.

Let's look into the row convolution filter:

In oclConvolutionSeparable_launcher.cpp, we can see that the local work size will be:

ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y

and the global work size will be:

(imageW / ROWS_RESULT_STEPS ) * imageH;

That means that we will have in total

ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y * (imageW / ROWS_RESULT_STEPS ) * imageH

threads to compute imageW * imageH output samples.

With the values given, that means 16 * 4 * imageW/8 * imageH = 8 times more threads than samples to compute.

So with that, I would expect each thread to compute partial results, then combine the results in some way.

However, in the kernel (in ConvolutionSeparable.cl) I can not see anything like that.

What I even see is that every kernel writes multiple results:

Again, with the values given (in oclConvolutionSeparable_launcher.cpp), then

every kernel will write out 8 values.

So to summarise, I can see 8 times more kernels launched than output samples to compute and each kernel writing 8 outputs values! I miss something there.

Does anyone understand this convolution sample? Could you please try to explain how it works?

Thanks a lot in advance.

Regards,

Babel.