CUDA and OpenGL - Beginner question

Hi,

I have project where I need to work with Qt, OpenGL and CUDA.
Basically, I need to create a CUDA library that will take a pointer to an image, process the image, give back a way to get the result. The library will be used to process the image, not for loading it or displaying it.

I am a beginner in CUDA and OpenGL so I’m kind of lost.
I was using this tutorial to create my CUDA library: OpenGL Interoperability with CUDA | 3D Game Engine Programming

Here some of my code:

__global__ void sobelfilter_kernel(int iw, int ih, unsigned char *source, unsigned char *dest)
{
    // Calculate our pixel's location
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;

    // Operate only if we are in the correct boundaries
    if(x > 0 && x < iw - 1 && y > 0 && y < ih - 1)
    {
        int gx = -source[iw*(y-1)+(x-1)] + source[iw*(y-1)+(x+1)] +
                 -2*source[iw*(y)+(x-1)] + 2*source[iw*(y)+(x+1)] +
                 -source[iw*(y+1)+(x-1)] + source[iw*(y+1)+(x+1)];
        int gy = -source[iw*(y-1)+(x-1)] - 2*source[iw*(y-1)+(x)]
                 -source[iw*(y-1)+(x+1)] +
                  source[iw*(y+1)+(x-1)] + 2*source[iw*(y+1)+(x)] +
                  source[iw*(y+1)+(x+1)];
        dest[iw*y+x] = (int) sqrt((float)(gx)*(float)(gx) + (float)(gy)*(float)(gy));
    }
}
int processingExec(int opengl_buffer, int width, int height)
{
	//Register the OpenGL Vertex Buffer with CUDA
	struct cudaGraphicsResource * cuda_vbo_resource = 0;
	cudaGraphicsGLRegisterBuffer(&cuda_vbo_resource, opengl_buffer, cudaGraphicsRegisterFlagsNone);

	//Give access authority of the vbo to cuda
	//This will effectively “lock” the resource to the CUDA resource object
	cudaGraphicsMapResources(1, &cuda_vbo_resource, 0);

	 //Get a pointer to the device memory that can be used in the CUDA kernel
	 size_t num_bytes = 0;
	 unsigned char * dev_pointer = 0;
	 cudaGraphicsResourceGetMappedPointer((void**)dev_pointer, &num_bytes, cuda_vbo_resource);

//Kernel
         sobelfilter(width, height, dev_pointer, dev_pointer);

	 //Unmap the result from CUDA in order to be able to use it in opengl
	 cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);
}

The opengl_buffer is the ID of my QOpenGLTexture created in Qt.
I’m using the code of a sobel filter as the kernel process.

First, I am not sure if I can use the same pointer for the source and the result for the sobelfilter.
Second, I have no idea what am I supposed to do to get back the result from the process. How am I supposed to get the result from the library to use it in OpenGL (on the Qt application, not inside the library)?
Third, the Unmap function is creating the “freeze” of the application. There is nothing displayed and the application is transparent. If I comment the Unmap function, I get the image but there is no sobel filter applied on it so basically, I think I am not getting back the result.

Thank you for your help!

Morgane

“First, I am not sure if I can use the same pointer for the source and the result for the sobelfilter.”

it depends on the mapping between source and result, from the perspective of the device/ threads
in some cases, the mapping is one to one, implying that a write can not overwrite a subsequent read, such that the source may very well become the result/ destination
in other cases, a bit of shuffling occurs, such that a write may very well overwrite a subsequent read

“Second, I have no idea what am I supposed to do to get back the result from the process. How am I supposed to get the result from the library to use it in OpenGL (on the Qt application, not inside the library)?”

the reverse of how you get the source/ data to/ on the device
memory copies between the device and host, and vice versa

“Third, the Unmap function is creating the “freeze” of the application. There is nothing displayed and the application is transparent. If I comment the Unmap function, I get the image but there is no sobel filter applied on it so basically, I think I am not getting back the result.”

you need to determine when the host and device must synchronize, and issue the necessary calls to this end
you specify no streams; hence you are using the default stream
all work in streams are sequential, meaning you can ‘forward schedule/ issue’; but you still need to know when the work is done - synchronize

Hi,

Thank you for your answer.
I tried to solve my problem based on your answer but I am not able to get it work.
One of my question is: Does cudaGraphicsGLRegisterBuffer copy the memory from the host to device ? Or do I have to add a cudaMemcpy ?
I added the cudaMemcpy from the host to device but it is still freezing. If I remove it, the result I am getting is the original image. The sobel filter is not applied on it. What is the problem ? Do you have some ideas of what I need to search for ?

int processingExec(int opengl_buffer, int width, int height)
{
	struct cudaGraphicsResource * cuda_vbo_resource = 0;
	cudaGraphicsGLRegisterBuffer(&cuda_vbo_resource, opengl_buffer, cudaGraphicsRegisterFlagsNone);

	cudaGraphicsMapResources(1, &cuda_vbo_resource, 0);

	size_t num_bytes = 0;
	unsigned char * dev_pointer = 0;
	cudaGraphicsResourceGetMappedPointer((void**)dev_pointer, &num_bytes, cuda_vbo_resource);

	unsigned char * result_pointer = 0;
	cudaMalloc((void **) &result_pointer, (num_bytes));

	unsigned char * respointer = 0;
	cudaMalloc((void **) &respointer, (num_bytes));

	sobelfilter(width, height, dev_pointer, result_pointer);

	cudaMemcpy((void**)respointer, (void**)result_pointer, num_bytes, cudaMemcpyDeviceToHost);

	cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);
	return 0;
}

I still do not understand where is the link between OpenGL and Cuda here. My “respointer” is never linked to openGL. I’m passing the GLuint of my OpenGL texture but not a pointer in the memory so I do not understand how it should know where to store the data. What should I pass into the cudaMemcpy ? (because I suppose that what I am doing is wrong).

Thank you,

Morgane

most api’s are asynchronous, most of the time
this is simultaneously magnificent and potentially dangerous
the host does not hold the device’s hand when it is done with its part - delegating work
this is great, as the host can now issue lists of work, instead of just work
this is terrible, when you do not synchronize
it is like the traffic officer that either improves or further worsens traffic flow

sobelfilter(width, height, dev_pointer, result_pointer);

cudaMemcpy((void**)respointer, (void**)result_pointer, num_bytes, cudaMemcpyDeviceToHost);

cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);
return 0;

the host is going to issue the kernel for the device to execute
the host is then going to issue a memory copy to get the results back
the host is going to start to clean up
the host is going to return, with the results supposedly in the array

hence, the host does not bother to check whether the device is finished with the kernel, and that the results are therefore in the result array
also, the host does not bother to check whether the device is finished, before it starts to clean up (arrays the device might be still referencing)

you need to synchronize - cudaDeviceSynchronize() as one method

also, if the kernel places the result in result_pointer on the device, and if you subsequently copy that result to the host - result_pointer >> res_pointer; how do you pass on the result to whatever is waiting for it - the program calling int processingExec()?

you can use the debugger and one or two lines of code to actually test whether you have proper data going into your kernel - dev_pointer/ source

do not forget to subsequently clean up after you malloc’s

Thank you for your answer. I’m actually working on it to add the synchronize function, etc.

It is exactly the question I am asking here. I have no idea how to that more particularly because it is a library that will be used in Qt/OpenGL.

“I’m actually working on it to add the synchronize function”

sobelfilter(width, height, dev_pointer, result_pointer);

cudaMemcpy((void**)respointer, (void**)result_pointer, num_bytes, cudaMemcpyDeviceToHost);

cudaDeviceSynchronize();

cudaFree(result_pointer);

cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);

return 0;

consider the host and device synchronized

“It is exactly the question I am asking here. I have no idea how to that more particularly because it is a library that will be used in Qt/OpenGL.”

i presume the qt application is aware of where the original data source is - the openGL part

the qt - or any other - application would then likely call processingExec() as library like this:

processingExec(openGL* get_the_source_data_here, type* put_the_result_here, type* additional_overhead_data_as_required)

hence, the application should pass to the library a pointer to the source, and a pointer to where it expects the result; probably best if the application pre-allocates the latter

you probably need to include the appropriate headers in the qt application, such that it can follow and comprehend the underlying cuda types

Thanks. (I said I was adding the synch because I have some errors with it (unspecified launch failure))

I understand but how am I supposed to link back the pointer of the result in OpenGL? In my example I can return “respointer” right ? But which OpenGL function will accept this pointer?
All the example I am seeing are using a texture and they display the result using OpenGL in CUDA (http://3dgep.com/opengl-interoperability-with-cuda/).
I supposed it is more an OpenGL question than a CUDA one right now.

“I understand but how am I supposed to link back the pointer of the result in OpenGL? In my example I can return “respointer” right ? But which OpenGL function will accept this pointer?”

i am starting to feel confused

i think you need to confirm what is where, when - when data is on the device (and where on the device), and when data is on the host, and when you expect data to be on the device/ host
the way i read the lines, you get data from the device via opengl, apply filters to it on the device via cuda, and copy the result back to the host
why would you then wish to link back a pointer to a result you perceive would still be opengl based, when you process it and push it into cuda/ ordinary/ non-opengl device memory

perhaps i do not have my thinking cap/ reading glasses on…

Sorry about the confusion. I have to say that I am confused as I am new in OpenGL and CUDA…
I’m not sure about the first part about the “get data from the device via OpenGL”.
My Qt/OpenGL code is based on the Cube example. You can see some of the code here: http://qt-project.org/doc/qt-4.8/opengl-cube.html

But basically, I’m using the vertex and fragments using .glsl code. I though it was from the host but I am not sure…

What I want to do is to:

  • Load an image in my application via OpenGL
  • Pass it to my CUDA library and process it (apply a sobel filter)
  • Get back the result from my CUDA library
  • Display it in my application via OpenGL

So basically, CUDA is just processing the image. My application should load and display it using OpenGL and my application should know nothing about CUDA code.

Let me know if you need more information, what are your thoughts about that, etc.
As I already told you, I’m new in CUDA and OpenGL so it is possible that I’m misunderstanding things.

a number of points:

a) what you want to do seems rather do-able
b) your cuda interop reference seems quite comprehensive - it delineates a number of core opengl principles, as well as their implementations
c) you may need some stitching to fuse your qt application sample and the cuda interop sample into the end-result you desire - the main difference between the 2 applications seems to be that the one draws using a vertex buffer and the other a pixel buffer, but the cuda interop sample clearly discusses the former as well

if it is true that you are an opengl and cuda beginner, i would, if i were you, seriously consider tackling the problem in very incremental steps, to make the implementation as easy as possible

you could very well:
a) start by simply taking the cuda interop source code as is
b) remove the kernel from mentioned source, and insert your own
c) extract (only) the opengl cube code from the qt application sample, and attempt to draw it within the cuda interop source
d) strip the completed work, and move the necessary parts back to qt

this way you would have a project/ source that compiles and that can be tested/ debugged from the word go
and you would be making incremental changes; so when things go south, you would be able to very quickly pinpoint the possible cause

the cuda programming guide discusses textures and their use within cuda; and the opengl primitives, etc should be documented, such that you can cross-reference opengl functionality within the samples, to follow what they do and do not do

Hi,

Sorry for the late answer.
Thank you a lot for your help. I was not sure if what I wanted to do was do-able so thank you for that.
I followed your advice about the implementation, step by step.
Right now, I think I have a problem with the OpenGL buffer probably.

I’m using a Vertex Buffer (QOpenGLBuffer) in Qt and I store its ID

// Create the Vertex buffer
m_vertex.create();
// Bind the buffer so that it is the current active buffer
m_vertex.bind();
// Allocate the vertices size
m_vertex.allocate(vertices, sizeof(vertices) * sizeof(VertexData));
m_opengl_vbo = m_vertex.bufferId();

And I am passing the ID to my library:

checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, opengl_vbo, cudaGraphicsMapFlagsWriteDiscard));
unsigned int *d_result;
checkCudaErrors(cudaMalloc((void **)&d_result, width*height*sizeof(unsigned int)));
unsigned int *h_result = (unsigned int *)malloc(width * height * sizeof(unsigned int));
checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));
size_t num_bytes_result;
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_result, &num_bytes_result,  cuda_pbo_resource));
boxFilterRGBA(d_img, d_temp, d_result, width, height, filter_radius, iterations, nthreads);
checkCudaErrors(cudaDeviceSynchronize());
checkCudaErrors(cudaMemcpy((unsigned char *)h_result, (unsigned char *)d_result, width*height*sizeof(unsigned int), cudaMemcpyDeviceToHost));

I am getting an error:

code=11(cudaErrorInvalidValue) "cudaMemcpy((unsigned char *)h_result, (unsigned char *)d_result, width*height*sizeof(unsigned int), cudaMemcpyDeviceToHost)"

And I am getting a segmentation fault when I am trying to display the value of the d_result.

What can be the problem ? Does the result of the “num_bytes_result” should be the same than the number of bytes of the image ? (here the result is 1600 whereas the image is 4194304).
Can it be a problem between using OpenGL Vertex buffer while using a Pixel buffer ? If yes, what do I have to do to get it work ? How to make the change between Vertex array and Pixel array ?

Thank you a lot for your help!

Morgane

i have to look more closely to what you are doing above, but what strikes my attention immediately, perhaps related to the error code you are getting, is that you declare h_result and d_result as unsigned int*, but cast both as unsigned char* in the subsequent memory copy
i am not sure whether the memory copy would appreciate this very much

Hi,
Thank you for your answer.
I deleted the cast but I am still getting the error.
Do you think it is related to the size of the d_result ?
Thanks again for your help, I really appreciate.

“Do you think it is related to the size of the d_result ?”

you can easily test this by decreasing the memory copy’s number of bytes to copy, and noting whether the error disappears or prevails
you could decrease it to a mere unsigned int, for test purposes

but,
the api document notes the function/ description of cudaGraphicsResourceGetMappedPointer as:
“Get an device pointer through which to access a mapped graphics resource”
“Returned pointer through which resource may be accessed”

hence, my thinking is that that particular call would overwrite the preceding
checkCudaErrors(cudaMalloc((void **)&d_result,

in plain words, you can not store 2 different pointers in the same pointer variable, and preserve both
hence, the memory copy may be complaining about the specified device memory pointer, rather than the size of the copy
just check

Hi,

I did some tests by removing the “checkCudaErrors(cudaMalloc((void **)&d_result” but I’m still getting the same error.

When I am modifying the size of the h_result to allocate the same size than the d_result like the following

unsigned int *h_result = (unsigned int *)malloc(num_bytes_result);
printf("cudaMemcpy starts\n");
checkCudaErrors(cudaMemcpy(&h_result, d_result, num_bytes_result, cudaMemcpyDeviceToHost));
printf("cudaMemcpy ends\n");

instead of

unsigned int *h_result = (unsigned int *)malloc(width * height * sizeof(unsigned int));
checkCudaErrors(cudaMemcpy(h_result, d_result, width*height*sizeof(unsigned int), cudaMemcpyDeviceToHost));

it’s freezing right after the CudaMempcy (I got “cudaMemcpy starts” printed only).

Do you have any ideas what is wrong with my code ? :(

Morgane

i do not think it wise to pass to cudaMemcpy, &h_result

you can simply test if the size upsets cudaMemcpy, by leaving the initial allocations as is, and altering the size pertaining to cudaMemcpy:

cudaMemcpy(h_result, d_result, sizeof(unsigned int), cudaMemcpyDeviceToHost)

i hardly think this is wrong:

unsigned int h_result = (unsigned int )malloc(width * height * sizeof(unsigned int));
checkCudaErrors(cudaMemcpy(h_result, d_result, width
height
sizeof(unsigned int),
cudaMemcpyDeviceToHost));

however,
the device (memory) pointer you pass to cudaMemcpy, is that returned by cudaGraphicsResourceGetMappedPointer
because you pass to cudaGraphicsResourceGetMappedPointer, d_result, just before the memory copy
the type of pointer returned by cudaGraphicsResourceGetMappedPointer may be incompatible with cudaMemcpy
cudaMemcpy expects a ‘hard’ device pointer, and i doubt whether the type of pointer returned by cudaGraphicsResourceGetMappedPointer can be classified as such; it may be fine for use by a kernel though
hence, i suggest using 2 pointers - splitting d_result in 2 pointers
one for the cudaMemcpy, and 1 for cudaGraphicsResourceGetMappedPointer
you pass the latter to your kernel to read from, and the former to your kernel to write to
and you pass the former to cudaMemcpy
the kernel should probably read via the pointer associated with (passed to) cudaGraphicsResourceGetMappedPointer, and should likely write to the pointer associated with (passed to) cudaMemcpy