Using Texture memory

OmSp · June 11, 2017, 7:45am

My application is split into two parts:

The first does all the raytracing and stores its result into buffers.
Then, to ensure all threads from the previous launch are synched, a second, separate, entry point does some post-processing on the buffers.

As far as I’m aware, the post processing in itself has coherent memory access. However, since every pixel also looks at its neighbours, in practice every buffer entry is read several (non-simultaneous) times by several different threads. This, in addition to the fact that I also look at “fractional” neighbours (as opposed to whole index offsets), suggests that using Texture memory might be a lot more efficient than accessing the buffer from global memory (both for speed and for the built-in interpolation).

So my actual question is twofold:

When declaring an RTtexturesampler and then setting an RTbuffer to it, am I guaranteed this data will reside in texture memory and not in global?
In host code, after the first launch is completed and before the second is called, I need to copy the data from the original buffer to the texturesampler’s buffer. Is there a way to do this without going through the CPU (which I assume is what map(), memcopy() and unmap() would do)? Perhaps something using cudaMemcpyDeviceToDevice?

Keith_Morley · June 12, 2017, 4:32pm

This is a very interesting question. Yes, you might get a little better perf using texture memory – and I am assuming that these buffers are all read-only. However, I cant think of a way to avoid the device-to-device memcpy. I am guessing this memcpy is likely to offset any benefits from hardware interpolation and texture bricking.

Answers to your specific questions:

For non-trivial cases (eg, 1x1 pixel), optix will put the data into texture memory.
Yes, you can use cuda-interop (and cudaMemcpy) or gl-interop (and glTexImage2d) for this purpose. However, I believe either of these will require a dev2dev copy.

OmSp · June 14, 2017, 9:03am

I guess the only way to know if it’s worth it or not is to try and compare.

After reading through the docs and the samples though, I’m still a bit confused about the interop usage.

Would something along the following lines do the trick then?

// Get device ordinal (single GPU case, this is only testing code)
unsigned int numOptixDevices = m_Context->getEnabledDeviceCount();
std::vector<int> devices = m_Context->getEnabledDevices();
int optixDeviceOrdinal;
m_Context->getDeviceAttribute( devices[0], RT_DEVICE_ATTRIBUTE_CUDA_DEVICE_ORDINAL, sizeof( optixDeviceOrdinal ), &optixDeviceOrdinal );

// Get CUDA buffer device pointers		
Buffer colorBuffer = m_Context["COLOR_BUFFER"]->getBuffer();  // Filled by the previous optix launch
void* pColorBDP;
RTresult res = rtBufferGetDevicePointer( colorBuffer->get(), optixDeviceOrdinal, pColorBDP );
if ( RT_SUCCESS != res )
	return false;
		
TextureSampler sampler = m_Context->createTextureSampler();
Buffer textureBuffer = m_Context->createBuffer( RT_BUFFER_INPUT, RT_FORMAT_UNSIGNED_BYTE4, width, height );  // make sure format/dims match colorBuffer
sampler->setBuffer( textureBuffer );
void* pTextureBDP;
RTresult res = rtBufferGetDevicePointer( textureBuffer->get(), optixDeviceOrdinal, pTextureBDP );
if ( RT_SUCCESS != res )
	return false;

// Perform CUDA memcpy
cudaSetDevice( optixDeviceOrdinal );
size_t widthInBytes = width*sizeof( uint4 );
cudaError_t lastError = cudaMemcpy2D( pTextureBDP, widthInBytes, pColorBDP, widthInBytes, widthInBytes, height, cudaMemcpyDeviceToDevice );
if ( cudaSuccess != lastError )
	return false;

Is there any cleanup I need to do afterwards?
If I reach this code segment a second time for another simulation, do I need to call cudaSetDevice again?

-Thanks!

Keith_Morley · June 15, 2017, 4:27pm

Does this code run without errors? I would expect the 2nd buffergetdeviceptr call to fail as you are trying to get a deviceptr from a texture-backed buffer – which I believe optix disallows. If optix does allow this, then your approach should work. Yes, you should set the cuda device each time you execute the memcpy.

What I had in mind was using GL-interop or CUDA-interop to apply a CUDA or GL post-process. For instance, you could:

create a GL buffer object (glCreateBuffer)
bind it to an optix buffer (rtBufferCreateFromGLBO)
render output to this buffer in optix
after optix launch assign the GL buffer to texture
** glBindBuffer( GL_PIXEL_UNPACK_BUFFER, glbo_id) and glTexImage2d
Use GL to do your post processing with this gl texture

OptiX isnt really well setup for this type of buffer-to-texture transfer. If your above code works let me know. If I get the time, I will try to experiment with such a setup myself.

OmSp · June 18, 2017, 9:23am

Of course that code segment doesn’t work, hence my posting it to begin with ;-)

It might be worth noting that moving the line “sampler->setBuffer” to after the copy operation doesnt throw any errors, I just silently get a blank buffer output instead.

When I naively tried attaching the buffer to the texture, I forgot to copy the error thrown, but it went something along the lines of “Cannot bind a variable to the texture”. So I tried another approach: “context->removeVariable”, and only then attaching, hoping that the device2device memcopy will be handled in the background. Again, no execution errors, but the output buffer was blank.

I’m really trying to avoid adding more dependencies to an already bloated application, so I’d prefer to stay away from openGL. Given your previous comment that you doubt if there will be an overall gain in performance, I’m inclined to stop fighting against something that the user is clearly blocked from doing.

mapic · March 10, 2019, 7:24pm

Hi OmSp,

I’m facing a similar situation like in your question 2). I have a TextureSampler that needs to be updated before rendering every frame. The map() → memcopy() → unmap() ruins all the performance. I was successful getting buffer’s data copied using cudaMemcpyDeviceToDevice, but not if they are bound to a TextureSampler.

I think Keith_Morley is right, I tested it and optix didn’t allow me to get the device pointer of the buffer when it was previously bound to a TextureSampler. I wonder if you have found how to overcome this issue, or if there is another approach to accomplish that.

Thanks!

m001 · March 11, 2019, 5:17am

@mapic
Try to create each time a new TextureSampler and a new buffer
(if its possible; If not: yet I think there’s no other way than copying through CPU map/unmap).

If possible to recreate the texturesampler, also update the texture_id variable in the OptiX program. And to avoid memory leaks first getBuffer from the TextureSampler, use ->destroy on the texturesampler and then use->destroy() on the old buffer. Maybe this permanent destroy and recreate of buffers+texturesamplers (if it really works) is faster than the GPU=>CPU=>GPU host copy.

I tried to copy the new texture data to an each time new created optix::Buffer (RT_BUFFER_INPUT) through a CUDA kernel (on CUDA 10, OptiX 6.0.0)
(=> for copying I used the device pointer of the buffer as parameter for the CUDA kernel as shown in
CUDA Samples\v10.0\2_Graphics\simpleD3D11Texture,
=> there are no CUDA errors or exception, but the result is not in the texture sampler.

I also tried: cudaMemcpyDeviceToDevice for copying to that new buffer (as you already did); but this does not work when the source is of type RT_BUFFER_INPUT) :

Buffer outputbuffer = context->createBuffer(RT_BUFFER_INPUT_OUTPUT,  RT_FORMAT_FLOAT4, width, height);

void* buffer_DevPtr = (float*)outputbuffer->getDevicePointer(0);
optix::Buffer test_buffer = context->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_FLOAT4, width, height);
test_buffer_DevPtr = (float*)test_buffer->getDevicePointer(0);
cudaMemcpy(test_buffer_DevPtr, buffer_DevPtr, sizeof(float) * 4 * width * height, cudaMemcpyDeviceToDevice);

buffer_DevPtr=406c00000h
test_buffer Dev Ptr=401000000h
CUDA error : cudaMemcpy cudaMemcpyDeviceToDevice failed : (11) invalid argument.

this succeeds, if test_buffer is created with RT_BUFFER_INPUT_OUTPUT

But for a buffer attached to a texture sampler you need RT_BUFFER_INPUT

So maybe you can try to use a pure optix::Buffer instead of some texture sampler (in cases where you simply only need the pixel data without the benefits of the texture sampler)

mapic · March 11, 2019, 10:08pm

Hi m1,

your approach is really interesting, if the TextureSampler creation/destruction is fast enough, this might work!

As I said, I successfully copied data from one buffer to another using cudaMemcpy, and in my case the destination buffer was declared as RT_BUFFER_INPUT. To inspect the values, I wrote a small RTProgram with a rtPrintf to display some elements of the buffer, and the retrieved values were correct. I tested it with RT_FORMAT_FLOAT, and I have just quickly tested it with RT_FORMAT_FLOAT2.

I also made a first attempt to implement your solution, but I have almost no spare time during the week. Hopefully, this weekend will have more time. I achieved to destroy and create the sampler, but the interpolation is giving me 0.0 everywhere.

I will be doing more research. If you want to mess around as well, I can post the piece of code I’m using to copy data. At least, it would be interesting if you can reproduce the successful results when copying data to a RT_BUFFER_INPUT.

And you are right, if everithing fails, there is always the option to use Buffers instead of TextureSamplers, handcoding the interpolation method, altough TextureSampler would be faster I suppose.

Thanks a lot for your help, I really appreciate!

m001 · March 12, 2019, 12:15am

Hi mapic,

Yes that piece of code would be really interesting. I’d really like to try to reproduce it.

mapic · March 12, 2019, 9:05pm

Hi again,

First of all, I’m using Optix 5.1 and CUDA 9.1, but also tried Optix 6.0 with a similar test. I took a look to the release notes of the 6.0 and I think there is no much difference in terms of what we are trying to achive.

This is the piece of code I used in my test. It worked fine, but this is the previous step before bringing the TextureSampler into play.

// Declare buffers (note the type is float2 and the second one is RT_BUFFER_INPUT)
bufferOrigin = m_context->createBuffer(RT_BUFFER_INPUT_OUTPUT, RT_FORMAT_FLOAT2, 2, 2);
bufferDestination = m_context->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_FLOAT2, 2, 2);
m_context["bufferOrigin"]->set(bufferOrigin);
m_context["bufferDestination"]->set(bufferDestination);

// Populate data in the first buffer
optix::float2* data = (optix::float2*)(bufferOrigin->map());
for (int i = 0; i < 4; i++)
	data[i] = make_float2(i,i);
bufferOrigin->unmap();

// First call to an Optix Program (this is important because the data we have just wrote to the buffer
// with the map() / unmap() won't be transfered to de GPU until we call an Optix Progeam for the first time,
// so it is pointless to do a cudaMemcpy yet since the data is not in the GPU)
m_context->launch(m_printProgramIndex, 100, 100);

// Now we get the pointers
cudaSetDevice(m_idGpuCuda);
float2* bufferOriginDevicePtr = (float2*)(bufferOrigin->getDevicePointer(m_idGpuCuda));
float2* bufferDestinationDevicePtr = (float2*)((bufferDestination)->getDevicePointer(m_idGpuCuda));

// dev2dev copy
cudaError_t lastError = cudaMemcpy(bufferDestinationDevicePtr, bufferOriginDevicePtr, sizeof(float2) * 4, cudaMemcpyDeviceToDevice);

// call to the print program (I do it this way since the buffer is RT_BUFFER_INPUT, so it is not possible
// to get the values back to the CPU)
m_context->launch(m_printProgramIndex, 100, 100);

The RT_Program I use to print buffer values from the GPU is this:

rtBuffer<optix::float2, 2>            bufferOrigin;
rtBuffer<optix::float2, 2>            bufferDestination;

RT_PROGRAM void printProgram() {
rtPrintf("Buffer origin at [1,1] : x =  %f , y = %f  \n", bufferOrigin[make_uint2(1, 1)].x, bufferOrigin[make_uint2(1, 1)].y);
rtPrintf("Buffer destination at [1,1] : x = %f , y = %f  \n", bufferDestination[make_uint2(1, 1)].x, bufferDestination[make_uint2(1, 1)].y);
}

After running the code, console output is this:

Buffer origin at [1,1] : x = 3.000000 , y = 3.000000
Buffer destination at [1,1] : x = 0.000000 , y = 0.000000
Buffer origin at [1,1] : x = 3.000000 , y = 3.000000
Buffer destination at [1,1] : x = 3.000000 , y = 3.000000

Lines 1 and 2 belong to the first call to the RT_PROGRAM, before the cudaMemcpy. The result is the expected, if you look how the data is populated, the destination buffer is 0.0f since it hasn’t been populated with data yet. Lines 3 and 4 are the interesting ones, after the cudaMemcpy. As you can see, the destination buffer holds the correct value now.

Please, tell me if you can reproduce this. As I said, I’m quite limited in time until friday, so I’m sorry if I don’t replay as soon as I wolud like to do it ;) .

m001 · March 13, 2019, 5:58am

Great!

Yes, I can confirm, that after running an OptiX Kernel which accesses only an OptiX source buffer (as you described), the dev2dev copy also succeeds on my system and the values are valid (using float4 buffers dimension 1024x768; CUDA 10.0; OptiX 6.0.0)

I initialized the first float of all the float4’s of the source optix::buffer (type RT_BUFFER_INPUT_OUTPUT) with sin(((float)x / (float)width) * M_PI) ;
then ran an OptiX kernel which has a read access to that buffer (and copies that data to another buffer; that other buffer is later ignored; but without it nvcc may compeltely optimize the memory access away, cause it would be senseless)
then I do the dev2dev as you said (target optix::buffer type: RT_BUFFER_INPUT).
and then that optix::Buffer (type RT_BUFFER_INPUT) is copied through a CUDA kernel into a DirectX11 StructuredBuffer;
The result then is read by a HLSL shader as a SRV and the pixel shader then outputs the color in the blue channel. No errors, no exceptions. So far so good. (see attachment)

But:
Attaching this RT_BUFFER_INPUT optix::Buffer to an optix::TextureSampler fails:
OptiX Error: ‘Invalid value (Details: Function “_rtTextureSamplerSetBuffer” caught exception: Buffers used as both buffer and texture are not supported)’

And so its clear: in the current OptiX versions a buffer either can be used as Buffer xor as an attachment in a TextureSampler.
So unfortunately then an each-frame-updated-texture only can be updated/accessed through a pure buffer yet. (Or over the host GPU=>CPU=>GPU)

mapic · March 13, 2019, 8:20am

Nice you got it work!

I have never used DirectX, so I don’t know how you do the second part of your test (altough I understand the overall idea). It would be nice if you can post the code, so I can learn and get a better idea.

I got this error as well, which is a known Optix limitation, and it comes to say that you can’t use the same buffer in a TextureSampler and as a rtBuffer in an Optix Program. However, as the final goal is to use the data as a TextureSampler and not as a buffer, there is no need to declare it as rtBuffer. In the code I posted, you should remove this line

rtBuffer<optix::float2, 2>            bufferDestination;

(and the corresponding rtPrintf and the m_context[“bufferDestination”]->set(bufferDestination);). If you do it, you will be able to create the TextureSampler, but the next error you will get is when you try to get de the device pointer of the buffer, if you have previously bound it to the TextureSampler. So here is where your idea (create a new TextureSampler every time) may come in handy. The key is to do the cudaMemcpy before binding the buffer to the sampler in some way, but this is still unknown.

m001 · March 13, 2019, 9:35pm

The rtBuffer decleration in my OptiX program already was removed; But now I also removed the corresponing context[“bufferDestination”]->set(…) call.
The TextureSampler now accepts that buffer (of type RT_BUFFER_INPUT) in that case also on my system without error.
However, the values in it are not valid (output is black).
And on closing my app “Critical error detected c0000374” occurs…
So using the buffers this way seems to be an undefined behaviour. I will discard further tests on this for now due to time restrictions.

Basically for copying the optix::Buffer to a DirectX StrucutredBuffer/Texture I use the technique shown in this sample:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.0\2_Graphics\simpleD3D11Texture

for the DirectX part please look at:
[url]https://docs.microsoft.com/en-us/windows/desktop/direct3dgetstarted/getting-started-with-a-directx-game[/url]
Sample Code: [url]https://github.com/walbourn/directx-sdk-samples/tree/master/SimpleSample11[/url]
NOTE: DirectX is Windows-Only.