Pass openGL data to CUDA. Question about speed.

My application is going to take the rendered results from openGL (both depth map and the rendered 2D image information)
to CUDA for processing.

One way I did is to retrieve image/depth map by glReadPixel(…, image_array_HOST/depth_array_Host)*, and then pass image_HOST/depth_HOST to CUDA
by cudaMemcpy(…, cudaMemcpyHostToDevice). I have done this part, although it sounds redundant. (from GPU>CPU>GPU).
*image_array_HOST/depth_array_Host are array I define on host.

Another way is to use openGL<>cuda interpol.
First step is to create one buffer in openGL, and then pass image/depth information to that pixel buffer.
Also one cuda token is registered and linked to that buffer. And then link the matrix on CUDA to that cuda token.
(as far as I know, seems there is no a direct way to link pixel buffer to cuda matrix, there should be a cudatoken for openGL to recognize. Please, correct me if I ma wrong.)

I have also done this part. It thought it should be fairly efficicent becasue the data CUDA is processing was
not transferred to anywhere, but just at where it is located on openGL. It is a data processing inside the device(GPU).

However, the spent time I got from the 2nd method is even (slightly) longerr than the first one (GPU>CPU>GPU).
That really confuses me.

I am not sure if I missed any part, or maybe I didn’t do it in an efficient way.

One thing I am also not sure is glReadPixel(…,*data).
In my understanding, if *data is a pointer linking to memory on HOST, then it will do the data transferring from GPU>CPU.
If *data=0, and one buffer is bind, then the data will be transferred to that buffer, and it should be a GPU>GPU thing.

Maybe some other method can pass the data more efficiently then glReadPixel(…,0).
Following is my code, and sorry I don’t know know to post the code properly, so it is showns as normal words.

Hope some people can explain my question. Many thanks.

// declare one pointer and memory location on cuda for later use.
float *depth_map_Device;
cudaMalloc((void**) &depth_map_Device, sizeof(float) * size); 


// inititate cuda<>openGL
cudaGLSetGLDevice(0);	


// generate a buffer, and link the cuda token to it -- buffer <>cuda token
GLuint gl_pbo;
cudaGraphicsResource_t cudaToken;	
size_t data_size = sizeof(float)*number_data;		// number_data is defined beforehand
void *data = malloc(data_size);
glGenBuffers(1, &gl_pbo);
glBindBuffer(GL_ARRAY_BUFFER, gl_pbo);
glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW);	
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaToken, gl_pbo, cudaGraphicsMapFlagsNone);	
    free(data);
    // now there is a link between gl_buffer and cudaResource


// now it start to map(link) the data on buffer to cuda 
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo);						
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, 0);			
// map the rendered data to buffer, since it is glReadPixels(..,0).
    // it should be still fast?   (GPU>GPU)
// width & height are defined beforehand. 
    // It can be GL_DEPTH_COMPONENT or others as well, just an example here.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gl_pbo);						
cudaGraphicsMapResources(1, &cudaToken, 0);		
    // let cufaResource which has a link to gl_buffer to the the current CUDA windows
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_Device,  &data_size, cudaToken);	
    // transfer data
cudaGraphicsUnmapResources(1, &cudaToken, 0);			// unmap it, for the next round

// CUDA kernel
my_kernel		<<<block_number, thread_number>>> (...,depth_map_Device,...);

Hi:

Since there is no reply yet, I push this post again.

Many thanks

I think I can answer my question partly now, and hope it is useful for some people.

I was binding pbo to a float cuda (GPU) memory, but seems the openGL raw image rendered data is unsigned char format, (following is my supposition) so this data need to be transformed to float and then pass to cuda memory. I think what openGL did is using CPU to do this format transformation, and that is why there is no big difference between with and without using pbo.

By using unsigned char (glreadpixel(…,GL_UNSIGNED_BYTE,0)), binding with pbo is quicker than without using pbo for reading RGB data. And then I pass it do a simple cuda kernel to do the format transformation, which is more efficient than what openGL did. By doing this the speed is much quicker.

However, it doesnt work for depth buffer. For some reason, reading depth map by glreadpixel (no matter with/without pbo) is slow. And then, I found two old discussions: Reading the Depth Buffer - Why so slow? - OpenGL: Advanced Coding - Khronos Forums

They pointed out the format question, and that is exactly what I found for RGB. (unsigned char). But I have tried unsigned char/unsigned short and unsigned int, and float for reading depth buffer, all performance almost the same speed.

So I still have speed problem for reading depth.

Does anyone have clue about how to increase the speed of reading depth map (with pbo probably)?

I think, ideally it should be able to be as quick as reading RGB (glreadpixels with pbo).

But currently, reading depth map cost me about 10 times more than reading red channel (GL_RED).

Late reply but maybe useful for others:

Since the access of the OpenGL depth buffer from CUDA is still a problem, the workaround would be to write your own fragment shader which stores the depth data in a RGB render buffer instead of the usual depth render buffer.