Hi,
I have been trying to understand the separable convolution example (the one located in the OpenCL/src/oclConvolutionSeparable of the SDK),
and I am puzzled.
Let’s look into the row convolution filter:
In oclConvolutionSeparable_launcher.cpp, we can see that the local work size will be:
ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y
and the global work size will be:
(imageW / ROWS_RESULT_STEPS ) * imageH;
That means that we will have in total
ROWS_BLOCKDIM_X * ROWS_BLOCKDIM_Y * (imageW / ROWS_RESULT_STEPS ) * imageH
threads to compute imageW * imageH output samples.
With the values given, that means 16 * 4 * imageW/8 * imageH = 8 times more threads than samples to compute.
So with that, I would expect each thread to compute partial results, then combine the results in some way.
However, in the kernel (in ConvolutionSeparable.cl) I can not see anything like that.
What I even see is that every kernel writes multiple results:
for(int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++){
float sum = 0;
for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++)
sum += c_Kernel[KERNEL_RADIUS - j] * l_Data[get_local_id(1)][get_local_id(0) + i * ROWS_BLOCKDIM_X + j];
d_Dst[i * ROWS_BLOCKDIM_X] = sum;
}
Again, with the values given (in oclConvolutionSeparable_launcher.cpp), then
every kernel will write out 8 values.
So to summarise, I can see 8 times more kernels launched than output samples to compute and each kernel writing 8 outputs values! I miss something there.
Does anyone understand this convolution sample? Could you please try to explain how it works?
Thanks a lot in advance.
Regards,
Babel.