openCL-openGL interop 4-5x slower than CG-shaders performance issues with OpenCL

Hi,
I never used Cuda before, but in the past, in order to perform general purpose GPU calculations, I used vertex and fragment shaders, written with the NVidia CG-language.
Recently I switched to OpenCL, mainly to port my calculations to non-NVidia hardware. Because my main interest lies in image processing, I exploited the use of the OpenCL-OpenGL interop, as this is claimed to be the most efficient method to display the image processed results on the screen.
The 2D image memory objects in openCL are nicely created from OpenGL textures and everything seems to be working just fine. But when I look at the frame rates at which my openCL kernels are calculated, they seem to be roughly 4 to 5 times slower than the CG-fragment shader counterparts. Even for a simple kernel that just inverts the color values of a gray scale image, the openCL kernel operates at only 330 FPS, while the corresponding CG-fragment shader does the same operation at roughly 1400 FPS.
Because OpencL runs through Cuda on Nvidia GPUs, I belief that the same results can be expected if one compares Cuda to CG-shaders. So at the moment these performance issues don’t convince me to rewrite all my code with OpenCL or CUDA!
Can anybody help me?

I’ve put the openCL kernel code versus the CG-fragment shader below.
In the CG-fragment case the output texture is coupled to an openGL Frame Buffer Object (FBO) and I write output through the render-to-texture mechanism.
Input and output textures are initialized completely the same in both cases:
input: glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, 2048, 1920, 0, GL_LUMINANCE, GL_UNSIGNED_BYTE, image_data);
output: glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, 2048, 1920, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);

OPENCL:

__kernel void InvertGrayImage(__read_only image2d_t input, __write_only image2d_t output)
{
const sampler_t usedsampler = CLK_FILTER_NEAREST|CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_NONE;
int2 Coord = (int2)(get_global_id(0), get_global_id(1));
float4 value = (float4)(1.f) - read_imagef(input, usedsampler, Coord);
write_imagef(output, Coord, value);
}

CG-fragment shader

void InvertGrayImage(in float2 TexCoord : TEXCOORD0, out float4 color : COLOR, uniform sampler2D Input)
{
float4 value = float4(1.f) - tex2D(Input, TexCoord);
color = value;
}

Hi!

First of all, the assumption that Cuda-CG interop will be the same as CL-CG interop is very bold. CUDA in most of the cases is faster than OpenCL. (I won’t get started on why this can be) Second, if something works with CG alone (such as image processing), I see no point in trying to make something else work, as CG is vendor-independant. I have coded CG on my ATi card without problems. CG creates GLSL shaders which are then passed to the driver. So if something works fine with CG alone, and it is fast, I really see no point in involving anything else.

If you really want to use OpenCL, to make something crossvendor, then the two shaders you gave are completely identical, and there’s no reason they should perform different. The big difference is in host code, where extra synchronization is required to make things work. You have seen, that the shader is so simple, that in CG it runs 1400FPS. With interop, you force the runtime to do checks 1400 times per second. There are interop applications that work this fast, if host code is well conditioned. If you could post the main loop of these two cases, that could reflect what the problem is (it should work faster than 330FPS).

But I still uphold my previous statement: CG works fine on ATi (for the most part, I have not had any problems with it). If it’s not something really tricky you want to do, I’d try how efficient it is with sole CG on ATi.

Hey,
thanks for your response.
To avoid any confusion, I didn’t mean that Cuda will be slower than OpenCL. If you look at all the performance tests and examples throughout the web, Cuda in most cases comes out as the fastest of the two. But for most examples I found on the internet, OpenCL still reaches at least 80% of the Cuda performance and will not be 4-5 times slower. So by logical deduction writing the same Kernel in Cuda will also give me the 4-5 times slower performance compared to CG-fragment code. That is, if the two example kernels are really the same and if everything I initialised and coded on the host side is optimal and correct. And the latter, I’m sure is not the case, as I’m not an expert at all when it comes to parallel programming and boosting performance.
I did try to run the CG-shaders on several AMD/ATI cards but I received a lot of issues concerning the use of for loops, control statements, profiles that were not supported and so on. Because I apparently didn’t have the skills to solve these issues, I tried to use GLSL directly instead, because, as you mentioned, GLSL shaders are created from the CG shaders when used on ATI cards. But for more elaborate algorythms, the GLSL versions didn’t perform ad good as the original CG-shaders.
But then again, there are probably some basic considerations or maybe small details that I’m overlooking in my openCL code (and back then, in my GLSL code), so I have put some host code below as you mentioned (I left out the error handling code). I hope this is what you meant.

The member variable mOCLObject represents a class that builds up all the necessary OpenCL initialization stuff like platforms, devices, contexts, command queues, programs, and so on. The variable Images[0] and Images[1] were successfully created from the command clCreateFromGLTexture2D(…). Maybe it’s also worthwhile to mention that I used all possible combinations for the local size of the work groups. The maximum I could use on my Nvidia card was 512 in total, so I tried (512, 1), (256, 2), … (1, 512) and smaller sizes.

OPENCL
[i]cl_kernel kernel;
cl_int errorcode;
kernel = clCreateKernel(*mOCLObject.mProgram, “InvertGrayScaleImage”, &errorcode);

//Set the arguments of the kernel function, which are just the input and output openCL images
errorcode = clSetKernelArg(kernel, 0, sizeof(cl_mem), Images[0]);
errorcode = clSetKernelArg(kernel, 1, sizeof(cl_mem), Images[1]);

size_t offset[2] = {0,0};
size_t global_size[2] = {2048, 1920};
size_t local_size[2] = {16, 16};
cl_event kernel_event;

//End all running openGl commands
glFinish();
//give OpenCL exclusive access to the device’s memory
errorcode = clEnqueueAcquireGLObjects(mOCLObject.mComQueues[0], NumOfTextures, Images, 0, NULL, NULL);

//execute the “InvertGrayImage” OpenCL kernel
errorcode = clEnqueueNDRangeKernel(mOCLObject.mComQueues[0], kernel, 2, offset, global_size, local_size, 0, NULL, &kernel_event);

//release the exclusive OpenCL access
errorcode = clEnqueueReleaseGLObjects(mOCLObject.mComQueues[0], 2, Images, 0, NULL, NULL);

//finish all OpenCL commands, because we need to use OpenGL again to draw the results.
clFinish(mOCLObject.mComQueues[0]);

//profile the timings of the kernel to calculate the Frames per second
cl_ulong start, end;
clWaitForEvents(1, &kernel_event);
clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
//end en start are expressed in ns, so we have to multiply it by 1.0e-9 to put it in seconds
float FramesPerSec = 1.f/((end-start)*1.0e-9);[/i]

Below you can see some CG-shader host code. mCGob is an Object from a class that successfully performed all the initialisations and compilations of the vertex and fragment programs.
This class was also involved in the determination of the optimal vertex and fragment profiles.
CG-host program:[u][/u]

[i]glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0.0, 1.0, 0.0, 1.0, 0.01, 1000.0);

//activate the FBO where we have to render the output to.
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, mFBO[19]);
//The output texture is coupled to color attachment zero of the frame buffer object 19
glDrawBuffer(GL_COLOR_ATTACHMENT0_EXT);

//Set the fragment profile for this shader, which is the 37th shader of an array of fragment shaders!
cgGLEnableProfile(mCGob->mFragmentProfile);

//Set the input texture as argument to the shader. mTexID[0] contains the input image
cgGLSetTextureParameter(mCGob->mFragments[37].mParameters[0], mTexID[0]);
cgGLEnableTextureParameter(mCGob->mFragments[37].mParameters[0]);

//update the parameters of fragment shader 37
cgUpdateProgramParameters(mCGob->mFragments[37].mProg);

//Load and bind the shader to the following OpenGL glBegin-glEnd instruction set
cgGLLoadProgram(mCGob->mFragments[37].mProg);
cgGLBindProgram(mCGob->mFragments[37].mProg);

//Set the viewport to match the resolution of the image
glViewport(0, 0, 2048, 1920);

glEnable(GL_TEXTURE_2D);
glBegin(GL_QUADS);
glTexCoord2f(0.0, 1.0);
glVertex2f(0.0, 1.0);
glTexCoord2f(1.0, 1.0);
glVertex2f(1.0, 1.0);
glTexCoord2f(1.0, 0.0);
glVertex2f(1.0, 0.0);
glTexCoord2f(0.0, 0.0);
glVertex2f(0.0, 0.0);
glEnd();
glDisable(GL_TEXTURE_2D);

//Disable the texture parameters
cgGLDisableTextureParameter(mCGob->mFragments[37].mParameters[0]);
//Disable the fragmentprofile.
cgGLDisableProfile(mCGob->mFragmentProfile);

glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, 0); //Activate the regular window frame buffer again to display results on the screen.[/i]

Are these the display loops in your program? If yes, then there are some things to point out, but if not, I don’t want to say stupid things. I some init things (which are good to know how things are initalized), but if it’s no government secret and the code is somewhat minimal, you could even attach it to a post.

Hey Meteorhead,

these are indeed pretty much the display loops of my program. Every time a camera in my system collects a frame, the openCL kernel will be executed. The only thing I didn’t include here are the actual commands to draw the result of the kernel calculation on the screen. These call the underlying OnDraw of a CWnd derived class, which just draws the output texture over a unit quad with general openGL instructions.

I have attached the header and .cpp files of two self made classes I use for the initialization of the necessary openCL stuff.

The TOpenCLObject class will build up pretty much everything you need for OpenCL. The class TKernels contains the actual kernel code, which is written in its constructor.

This TKernel class is used in the TOpenCLObject class as input for the function clCreateProgramWithSource(…). The resulting cl_program can be used for the final clBuildProgram(…) function in order to compile my OpenCL code. Not much kernel code is in there (except evidently the code of the kernel which gave me the performance questions in the first place), as I just started with OpenCL. This TKernel class should however contain all my necessary kernel code in the near future.

Then I can use these two classes in for instance the OnInitDialog() function of a dialog based application in order to, once and for all, initialise the OpenCL stuff which I can retrieve anywhere in my program through a member variable of this TOpenCLObject. This I call mOCLObject.

Now my initialization code looks like this

[i]int numofplat = TOpenCLObject::GetNumberOfPlatforms(); //calling a static function from the TOpenCLObject class to get the number of valid openCL platforms.

                                                      //On my computer I have two platforms, where the first is NVIDIA (with 1 GPU) and the second AMD (with one quadcore CPU)

bool temp = mOCLObject.Initialize(numofplat,0); //Initialize all devices from platform 0, which in my case is an NVidia platform containing the GPU = GeForce GT 320

temp = mOCLObject.CreateContext(true); //Creating the OpenCL context. The true argument indicates that I want openCL-openGL interop.

temp = mOCLObject.CreateComQueuePerDevice(CL_QUEUE_PROFILING_ENABLE); //One command queue is created for every device on the chosen platform. The argument specifies that I want to profile timings.

TKernel TheCode; //Just calling an object of the TKernel class is enough to have all the code available

temp = mOCLObject.ConstructProgram(&TheCode); //This function creates a cl_program object in my TOpenCLObject class.

temp = mOCLob->CompileOpenCLPogram(0); //The previously created Kernel program will be compiled, where argument 0 specifies which device I use.[/i]

From now on mOCLObject contains everything I need and because it’s a member variable of my application class, it can be used everywhere in my program.

I’ve left out the error handling code.

I hope this is what you asked?
TKernels.h (979 Bytes)
TKernels.cpp (1.78 KB)
TOpenCLObject.cpp (14.7 KB)
TOpenCLObject.h (5.02 KB)

Unfortunately, this is still not a complete code, so I’ll make a few assumptions.

  • I take it that you create your images with the appropriate CL_MEM_READ_ONLY and CL_MEM_WRITE_ONLY flags that allow the runtime to take a few optimizations.

  • Although I would be very much surprised, but judging by what you said, you create the kernel in every frame. That I cannot say how unnecessary it is.

  • Rather common beginner mistake is setting the kernel arguments per kernel call. Yet again unnecessary.

  • Calculating profiling when the GPU is not working is a waste. Profile your kernel between issuing GL draw commands and glFinish().

  • If you use immediate mode drawing (which is slow by itself), I believe there is no need for glFinish(), which is yet again an extremely costly function, even if there is actually nothing to synchronize.

I highly advise you to not use immediate mode render ever again. It was highly unomptimal even years ago, not to mention now. I suggest getting to know OpenGL 3.x+, as that is very close to Cg, not to mention faster and more flexible.

As for the pitfalls in your application, your main loop should look something like this (and ONLY these commands): (let’s suppose I write both how it should look like in immediate and in GL3.2)

while(!quit)

{

   cl_event kernelStat;

if(immediate) glFinish();

clAcquireGLObjects(images);

   clEnqueueNDRangeKernels(myKernel, ... , kernelStat);

   clEnqueueReleaseObjects(images);

clFinish();

if(immediate) {glBegin(); ... glEnd();}

   else glDrawArrays();

calculateProfiling(kernelStat);

}

Anything you do on the CPU, try to do it while the GPU is working. Always try to avoid one device waiting for the other. If anything is unclear, please tell me.

Thanks for all the tips. Seems like you caught me, I am a beginner indeed External Image
But anyway I did use the flags CL_MEM_READ_ONLY and CL_MEM_WRITE_ONLY (so at least that part I got right).
I did create a cl_kernel at every kernel call, so for every frame and I did set the arguments at every call (which I’m also doing in the CG-shader version of my program).
I also use immediate drawing mode but that I’m also doing in the CG shader-version of the program. So there should not be any difference between my openCL and CG project. I only draw 4 vertices combined with the GL_QUADS flag and use 4 texture coordinates. So I don’t think that will be the bottleneck, but as you mentioned there are much better ways to do this and I will make use of these in the future.
But now there’s a big but!
I adjusted everything to match the main loop you suggested and I see absolutely no difference in the performance even without glFinish. It still goes around 330 FPS. It seems that creating kernels, setting kernel arguments, giving exclusive access to openCL and releasing the access all are executed in negligible time compared to the actual kernel.
I don’t exactly understand your suggestions concerning the timings while the GPU is running. I thought that if you gave an event as argument to the clEnqueueNDRangeKernel command, it would obtain all timing info after the kernel had finished. So when the kernel is finished, the event is completely defined and it doesn’t matter when you actually call the start and end info of the event. But that’s just what I make up of the explainations in the OpenCL Specs and I’m probably wrong here.
I tried another way to do some rough timings, which probably can be called “timing by a dummy” External Image, But I’m taking my chances here. I loaded one image and then executed the main loop once. But in this main loop I put a for-loop of 10000 iterations around the clEnqueueNDRangeKernel(…) function. Not so surprisingly the results popped up on the screen after roughly 30 seconds, which means around 330 FPS (10000/30). I performed different timings like this as shown in the attached files. My conclusion from this is that creating kernels and setting arguments and even glfinish and clFinish are negligible compared to the actual kernel call and the clEnqueueAcquireGLObjects/ReleaseGLObjects functions, at least for this example.

In the meanwhile I found an interesting and maybe an analogous topic at this link: http://devgurus.amd.com/thread/159112
But I don’t know what they mean here by tiled format??
Timings.rtf (43 KB)