Reading and Writing OpenGL Textures with Cuda

Gummel · December 15, 2013, 12:23am

Hello,

for almost 2 weeks i´m trying to use the cuda opengl interop interface. I want to read from a GL_TEXTURE_2D (the oldColor), do a [newColor = vec4(1.f) - oldColor] operation in cuda and write to a GL_TEXTURE_2D (the newColor). Just as a simple example that can be extended to some further operations.

But all i achieved so far is a lot frustration, sometimes access violation errors but never a working example. All i see when i use the texture that has to be written by cuda, is the old gpu memory content.

I first tried to write a fixed color (rgba - 0 0.6 0.8 1) to a texture, then i wanted to read from a texture and store it instead of the fixed color.

What i did is this:

I created a texture during the init phase (writeTex is a GLuint):

glGenTextures(1, &writeTex);
glBindTexture(GL_TEXTURE_2D, writeTex);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_FLOAT, nullptr);

After creating the textures i did the following cuda calls (still during the init phase):

auto e = cudaGLSetGLDevice(0);
e = cudaGraphicsGLRegisterImage(writeRes, writeTex, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsSurfaceLoadStore);

Then in every render pass i did this:

auto e = cudaGraphicsMapResources(1, &writeRes);
cudaArray_t writeArray;
e =  cudaGraphicsSubResourceGetMappedArray(&writeArray, writeRes, 0, 0);
cudaResourceDesc wdsc;
wdsc.resType = cudaResourceTypeArray;
wdsc.res.array.array = writeArray;
cudaSurfaceObject_t writeSurface;
e = cudaCreateSurfaceObject(&writeSurface, &wdsc);
fillBlue(writeSurface, dim3(width, height));
e = cudaDestroySurfaceObject(writeSurface);
e = cudaGraphicsUnmapResources(1, &writeRes);
e = cudaStreamSynchronize(0);

Then i did a simple OpenGL texture rendering:

... shader setup matrices ...
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, writeTex);
glBindSampler(0, texturesampler);
glUniform1i(locTextures_image, 0);
... draw call ...

The cuda kernel call function fillBlue(...) looks like this:

void fillBlue(cudaSurfaceObject_t surface, dim3 texDim)
{
   dim3 thread(32, 32);
   dim3 block(texDim.x / thread.x, texDim.y / thread.y);
   blue <<< block , thread >>>(surface, texDim);
}

and the cuda kernel:

blue(cudaSurfaceObject_t s, dim3 texDim)
{
   unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
   unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

   if(x >= texDim.x || y >= texDim.y)
   {
      return;
   }
 
   float4 data = make_float4(.0f, .6f, .8f, 1.0f);
   surf2Dwrite(data, s, x * sizeof(float4), y);
}

I don´t get errorValues when checking the error-Variable (e).

So my question is, am i doing it fundamentally wrong or why do i see the old memory content all the time instead of the (0 .6 .8 1) color when i render the texture?

Thanks for your help so far!

Markus

cmaster.matso · December 15, 2013, 12:47pm

Is it possible to write an OpenGL texture through a CUDA surface?

MK

Gummel · December 15, 2013, 4:26pm

According to CUDA/OpenGL interop, draw to OpenGL texture with CUDA - Stack Overflow it seems so.

But i wouldn´t be surprised if it is not possible.

I didn´t find a working example regarding this topic so far, but if it isn´t possible at all, the whole cuda approach is a dead end. My plan is to render solid geometry to a texture_3D but if it fails already with texture_2D…

Christoph1 · December 15, 2013, 5:54pm

What Compute Capability does your card provide? You are using Surface Objects API, which needs a card with CC 3.5 I guess…
If this is the point you’d have to use Surface Reference API. Otherwise I have no clue.

Gummel · December 15, 2013, 6:22pm

I have a Geforce GTX 780. I think it should provide CC 3.5. But in Case: How can i query the compute capability? cudaGetDeviceProperties?

That is the result of the properties query:
External Media

Christoph1 · December 15, 2013, 7:05pm

According to CUDA GPUs - Compute Capability | NVIDIA Developer 780 should definitely be capable of CC 3.5.
So this shouldnt be the case.

Gummel · December 15, 2013, 7:30pm

Thank you for your help so far! Then it seems to be a conceptual misunderstanding, right?

My current state is, that the texture contains the old garbage but in the second rendering iteration i get an access violation when i create a new cudaSurfaceObject_t.

That is the current code:

void CudaTextureScene::cudaPass()
{
   gpu::error::check(cudaGraphicsGLRegisterImage(&writeresource, cudatexture, gl::TEXTURE_2D, cudaGraphicsRegisterFlagsSurfaceLoadStore));

   gpu::error::check(cudaGraphicsMapResources(1, &writeresource, 0));

   cudaArray_t writeArray;
   gpu::error::check(cudaGraphicsSubResourceGetMappedArray(&writeArray, writeresource, 0, 0));

   gpu::functions::fillBlue(&writeArray, dim3(texture->getWidth(), texture->getHeight()));

   gpu::error::check(cudaGraphicsUnmapResources(1, &writeresource, 0));

   gpu::error::check(cudaGraphicsUnregisterResource(writeresource));
}


void fillBlue(cudaArray_t *writeTo, dim3 textureDim)
{
   struct cudaResourceDesc description;
   memset(&description, 0, sizeof(description));
   description.resType = cudaResourceTypeArray;
   description.res.array.array = *writeTo;

   cudaSurfaceObject_t write;
   gpu::error::check(cudaCreateSurfaceObject(&write, &description));

   dim3 threads(32, 32, 1);
   dim3 blocks(textureDim.x / threads.x, textureDim.y / threads.y);

   blue<<< blocks , threads >>>(write, textureDim);

   gpu::error::check(cudaDestroySurfaceObject(write));
}

After the first rendering iteration i get to the cudaGraphicsGLRegisterImage(…) (line 3) and it reports error 30 (unknown error). If i step through the code until cudaCreateSurfaceObject(…) (line 26) I get an access violation at 0xccc…

I´m pretty clueless but i suppose it won´t get easier with OpenCL. And i start running out of time because it´s a university project.

Gummel · December 15, 2013, 11:58pm

It seems the cause of the access violation is located in the cuda kernel (i changed the if statement):

__global__ void blue(cudaSurfaceObject_t target, dim3 dimension)
{
	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	if(x < dimension.x && y < dimension.y)
	{

		float4 data = make_float4(0.f, .6f, .8f, 1.f);
		surf2Dwrite(data, target, x * sizeof(float4), y);
	}
}

If i comment out the if statement i don´t get the access violation any longer (and the texture shows the old gpu memory content).

The texture dimension is 1920 x 1080 and i call the kernel with this:

cudaSurfaceObject_t write;
gpu::error::check(cudaCreateSurfaceObject(&write, &description));

dim3 threads(30, 30);
dim3 blocks(tex->getWidth() / threads.x, tex->getHeight() / threads.y);
dim3 texDim(tex->getWidth(), tex->getHeight());

blue<<< blocks, threads >>>(write, texDim);

Any ideas what i do wrong?

cmaster.matso · December 16, 2013, 7:46am

By commenting out ‘if statement’ You mean it’s whole contents or only the line 6?

MK

Gummel · December 16, 2013, 9:56am

I mean the whole if statement but I can limit the problem to line 10. The surf2Dwrite call seems to produce the errors. Cuda-Memcheck reports with active line 10 a varying number of errors (61 in the last run).

A single cudaGraphicsSubResourceGetMappedArray hits error 30 message and a large list of invalid global read of size 4 in the kernel. All of them try to access address 0x40100008 which is out of bounds.

The interesting part is that the access is independent of the blocks. The threads x component seems to follow a pattern (29, 28, 27, 26 and so on) but the y component of the threads is fix to 29. Would a dump in a textfile be helpful?

Thank you so far!

Gummel · December 16, 2013, 10:00pm

Turned out that it also crashes with

dim3 threads(1);
dim3 blocks(1);

so it seems the access to the cudaSurfaceObject_t itself causes the error…

This is the report of cuda-memcheck with the recently posted dim3 objects:

========= Invalid __global__ read of size 4
=========     at 0x000005e0 in g:\tools\nvidia cuda sdk\toolkit\include\/surface_indirect_functions.h:4899:blue(__int64, dim3)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x40100008 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\Windows\system32\nvcuda.dll (cuLaunchKernel + 0x166) [0xca26]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\cudart32_55.dll [0x2e2a]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\cudart32_55.dll (cudaLaunch + 0xd8) [0x14e98]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (_GLOBAL__N__41_tmpxft_000022f0_00000000_5_kernel_cpp1_ii_d947d236::cudaLaunch<char> + 0xc) [0x33fac]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (__device_stub__Z4bluey4dim3 + 0x46) [0x31146]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (blue + 0x14) [0x30484]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (cudaPass + 0x13d) [0x2f98d]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (render + 0x52) [0x2f4c2]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (main + 0xd6) [0x31076]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (__tmainCRTStartup + 0x199) [0x78f89]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (mainCRTStartup + 0xd) [0x7917d]
=========     Host Frame:C:\Windows\syswow64\kernel32.dll (BaseThreadInitThunk + 0x12) [0x1336a]
=========     Host Frame:C:\Windows\SysWOW64\ntdll.dll (RtlInitializeExceptionChain + 0x63) [0x39f72]
=========     Host Frame:C:\Windows\SysWOW64\ntdll.dll (RtlInitializeExceptionChain + 0x36) [0x39f45]
=========
========= Program hit error 30 on CUDA API call to cudaGraphicsSubResourceGetMappedArray
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\Windows\system32\nvcuda.dll (cuProfilerStop + 0x880f9) [0xa4a29]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\cudart32_55.dll (cudaGraphicsSubResourceGetMappedArray + 0x103) [0x1a003]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (cudaPass + 0x50) [0x2f8a0]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (render + 0x52) [0x2f4c2]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (main + 0xd6) [0x31076]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (__tmainCRTStartup + 0x199) [0x78f89]
=========     Host Frame:F:\Studium\Master_Semester_II\Hauptseminar\OpenGLCudaTest\Debug\OpenGLCudaTest.exe (mainCRTStartup + 0xd) [0x7917d]
=========     Host Frame:C:\Windows\syswow64\kernel32.dll (BaseThreadInitThunk + 0x12) [0x1336a]
=========     Host Frame:C:\Windows\SysWOW64\ntdll.dll (RtlInitializeExceptionChain + 0x63) [0x39f72]
=========     Host Frame:C:\Windows\SysWOW64\ntdll.dll (RtlInitializeExceptionChain + 0x36) [0x39f45]
=========
========= ERROR SUMMARY: 2 errors

cmaster.matso · December 17, 2013, 6:47am

Dunno will it help (may be a bit of off-topic), but when I want to write stuff to a texture I use pixel buffer object ‘trick’ - create it and register in CUDA, map it as kernel input/output, do the kernel work on it, and finally copy the outcome from PBO to bound GL texture.

MK

Gummel · December 17, 2013, 2:39pm

Hi,

thanks for your help! I solved the problem.

You can write to a OpenGL texture directly without the need of a pixel buffer object if you do it like described above. The texture is generated as a GL_RGBA texture containing GL_FLOAT values. That is why I tried to write a float4 value into the texture. It turns out that the OpenGL texture seems to assume 4 byte for each color so writing 4 floats (or 16 bytes) into the texture is just wrong. Although i assumed the texture to be semantically “4 floats each color channel” i had to write a uchar4 (4 byte) to each pixel (see code below). Unfortunately i didn´t find any documentation about it. But i suppose i should have known sigh

__global__ void blue(cudaSurfaceObject_t target, dim3 dimension)
{
	unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

	if(x < dimension.x && y < dimension.y)
	{
		//float4 data = make_float4(0.f, .6f, .8f, 1.f);
		//surf2Dwrite(data, target, x * sizeof(float4), y);

		uchar4 data = make_uchar4(0x00, 0x99, 0xcc, 0xff);
		surf2Dwrite(data, target, x * sizeof(uchar4), y);
	}
}

But thank you for your help!

cmaster.matso · December 18, 2013, 6:23am

And do You recon is it faster then PBO approach? The fact that one need to convert each channel to unsigned byte isn’t that good, though. What about precision? Will it be as good as with PBO trick? Are there any better ways, with current CUDA version?

MK

Gummel · December 19, 2013, 12:15am

Hi,

i didn´t implement the PBO approach so i don´t know if it is faster, but as far as i know it requires a glSubImage2D after the Cuda Kernel to copy the content of the the PBO into the texture. And i still have to write the data into the PBO…

The GL_FLOAT specified by the glTexImage2D is refering to the data type of the data pointer (in this case the nullptr), so it has nothing to do with the internal representation of the texture data. You don´t have to convert the whole channel into ubyte i guess… i forgot about the meaning of ths function parameter (gl has too many parameters, i suppose)

When I´m done with this semester I could do some research about the performance of this solution but i´d expect a better performance with the current solution. My current kernel time for a 1920x1080 texture is about 1.8 ms (read from texture, do a color negation (1 - old) and write to a texture, so 2 texture accesses are required)

Joelicia · June 16, 2015, 11:38am

Hi,
An other solution to this problem is to put
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, width, height, 0, GL_RGBA, GL_FLOAT, nullptr);
instead of
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_FLOAT, nullptr);

and you will be able to use float4 instead of uchar4.

opengpu · June 20, 2023, 8:01am

is it able to write to the OpenGL default BackBuffer which is GL_BACK by cudaSurface_t?
Thanks!

opengpu · June 21, 2023, 12:55am

should use cudaGraphicsRegisterFlagsWriteSurfaceLoadStore, but not cudaGraphicsRegisterFlagsWriteDiscard ?

Robert_Crovella · June 23, 2023, 12:37am

as discussed here, when writing to an openGL resource from CUDA, the cudaGraphicsRegisterFlagsWriteDiscard can be used to indicate a 1-way type of delivery.

This is (at least) applicable to writing an OpenGL resource that has been exposed to CUDA as a linear buffer.

the cudaGraphicsRegisterFlagsWriteSurfaceLoadStore is used when writing to a surface, (or reading from a surface) and such write and read activity is always to a cudaArray, not a linear buffer.

The two cases are not necessarily the same. But I believe you can probably combine those two flags when doing writes (only) to a surface, although I don’t know what affect it would have, if any.

opengpu · June 25, 2023, 6:59am

one as Linear, the other as Not Linear. Then what will happen if combined them?