Optix 4 and CUDA interop, new limitation with input/output buffers

thomasp · June 24, 2016, 10:06am

I used to allocate a large buffer of data with CUDA and forward pointer to Optix so it could write data in it. Something like this:

CUDA_PART.cpp :

cuda_buffer = ...; // create CUDA buffer
...
// use pointer arithmetic to make Optix work on a chunk
ray_tracer->compute(cuda_buffer+offset, chunksize);

OPTIX_RAY_TRACER.cpp

optix_buffer = createBufferForCUDA(RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL, RT_FORMAT_USER, 1);

RayTracer::compute(void * dData, int size)
{
   optix_buffer->setDevicePointer(deviceId, dData);
   ...
   optix launch
   ...
   // >>>> at this point, Optix has written data in memory allocated by cuda_buffer
}

With Optix 4, I can’t do this anymore (use an input/output buffer as target of cuda interop)

What are the suggested workaround ? Reading the programming guide it’s not clear to me. I feel like an interesting feature has been removed in Optix 4

MartinStich · June 29, 2016, 2:02pm

OptiX 4 is slightly stricter when it comes to CUDA interop, in order to make the API semantics more well defined (e.g. in the multi-GPU case). One visible change is that rtBufferSetDevicePointer no longer works on output buffers.

Note however that rtBufferGetDevicePointer does still work on output buffers. So if you can structure your application in a way that leaves memory allocation for these buffers to OptiX, then you should be able to implement the same use case in a way very similar to what you already have. One thing to keep in mind is that this function may return a zero-copy pointer (it usually will in the multi-GPU case).

All that being said, for gpu-local buffers like in your use case, we can probably make setPointer work. We’ll look into it.

Thanks for your feedback!

Martin

thomasp · June 29, 2016, 7:41pm

Yes I tried that. It’s only +1 GPU/GPU copy compared to OptiX 3 solution, but it seems acceptable

However I never managed to make it work, due to another bug/thing I failed to explain. Spent two days on this and had to give up.

OPTIX_RAY_TRACER.cpp (OptiX 4)

optix_buffer = context->createBuffer(RT_BUFFER_INPUT_OUTPUT, RT_FORMAT_USER, 1);

    RayTracer::compute(void * dData, int chunkSizeInBytes)
    {
       // not *real* code but it gives an idea
       ...
       context->launch(..);
       ...
       // copy optix_buffer back to cpu to debug its content
       int elementCount = (chunkSizeInBytes / sizeof(my::payload)); // elementCount ok as expected
       std::vector<my::payload> hVector(elementCount);
       memcpy(hVector.data(), optix_buffer->map(), chunkSizeInBytes);
       optix_buffer->unmap();
       
       // At this point, only the first height elements are present on CPU
       //
       // hVector[0] ok result
       // hVector[1] ok result
       // hVector[...]
       // hVector[7] ok result
       // hVector[8] 0-filled
       // hVector[9] 0-filled
       // hVector[...]
       // hVector[last] 0-filled
    }

This is only observable if optix_buffer is written directly in the raygen program.

If it’s written in closestHit program (as it is intended to be), hVector is completely 0-filled

I’ve recompiled all kernels with sm_35 on a GTX 750 Ti with latest drivers (cuda 6.5, Win7)

I should make a compilable example (and/or clean my code), but I don’t have time now. This bug is not reproduced in a second OptiX RayTracer used in the same app. The second engine is slightly simpler though

Anyway thank you for OptiX it’s a very nice tool

droettger · June 30, 2016, 7:26am

If there are inexplicable bahaviours like this, it’s normally worth checking individually if a newer display driver (=> CUDA driver) or a newer CUDA Toolkit (7.5 instead of 6.5 you’re using) is solving the issue. I would also recommend to check against the newest stable OptiX release 3.9.1.

thomasp · June 30, 2016, 9:02am

Ok, I already rolled back to 3.9.1 and it works as expected

I will try CUDA 7.5 when I can

MartinStich · July 12, 2016, 8:11am

FYI, we have added support for setting pointers on GPU_LOCAL buffers. It will be available in the upcoming OptiX 4.0 release.

Martin

thomasp · July 12, 2016, 9:11am

Great, thank you :)
When do you plan to release the next version ?

thomasp · July 21, 2016, 10:43am

Hi

I tried to narrow down my investigations, but to no avail
I just sent an OAC to optix-help mail, could you please take a look and let me know what is wrong ?

Thank you!

edit:
I forgot to say that using CUDA 7.5 (instead of CUDA 6.5), I had to set the cudaDeviceMapHost device flag to stop OptiX complain. The OptiX error message appeared at a buffer creation :

Invalid value (Details: Function “_rtBufferCreate” caught exception: CU_CTX_MAP_HOST is required for zero copy, but is disabled in the context. Fix context creation to include ‘CU_CTX_MAP_HOST’., file:C:\u\workspace\goldenrod-win64-build\sw\wsapps\raytracing\rtsdk\playpen\goldenrod\src\Device\CUDADevice.cpp, line: 147)

dlacewell · August 22, 2016, 7:38pm

We resolved some of this on optix-help, but I’ll summarize here.

The initial request about setDevicePointer for buffers marked (INPUT_OUTPUT | GPU_LOCAL) was resolved in the 4.0 release
The follow up bug report and trace about a buffer not being copied completely back to the host was an OptiX bug with 3d launches. We will fix this in the next release. Please work around it with a 2d launch where possible.

droettger · September 19, 2016, 10:20am

The OptiX 4.0.1 release is available now.
The 3D launch issue is one of the fixes mentioned in the release notes.

George_Liao · November 9, 2016, 1:05pm

Hey guys

May I ask a rookie question of Optix and CUDA interop here since it is very easy for you?

I am trying to set cudabuffer and make optix recognize that. Similar with the sample code thomasp showed here, but as the thomasp showed in CUDA_PART.cpp, it is .cpp file, so it does not seem to be cudaMalloc to creat cuda buffer . If it is not, may I ask how you create cuda buffer? OPTIX programming guide also does not say too much about this.

cuda_buffer = …; // create CUDA buffer

The omitted part.

Let me know, thanks a lot.

Best regards
George Liao

Keith_Morley · November 9, 2016, 5:01pm

Hello George. I am not sure I fully understand your question. Yes, you can use cudaMalloc here in this CPP file to allocate cuda memory and then call rtBufferSetDevicePointer with the resulting pointer. Something like:

float* d_ptr;
cudaMalloc( &d_ptr, sizeof(float)*num_float );
my_buffer->setDevicePointer( 0, d_ptr );

Perhaps you are not clear on the CUDA model of having a host-side API and a device-side CUDA-C. You might want to take a look at Section 3.1 of the CUDA C Programming guide for further discussion of what nvcc does and how it ties in with CUDA host code.

George_Liao · November 22, 2016, 4:17pm

Hey Kmorley

Thanks for your reply, I do not know how we can call cudaMalloc via c++ compiler? I thought we always need nvcc compiler to use this function. Can you tell me a little bit more about this? Thanks.

Best regards

Keith_Morley · November 22, 2016, 9:47pm

Yes, the cuda host-side API is compilable with standard host compilers.

Please take a look at the CUDA programming guide – particularly the section I referred to above.

George_Liao · November 24, 2016, 2:37pm

Hey kmorley

I made cudaMalloc work on cpp file already, but not together with optix.

Here is my sample code:

include <stdio.h>
//include <optix_world.h>

include <cuda_runtime.h>

int main()
{
//optix::Context context = optix::Context::create();
int *cuda_buffer;

cudaMalloc((void**)cuda_buffer, 5 * sizeof(int));
}

This works, main.cpp file, only call cudaMalloc.

include <stdio.h>
include <optix_world.h>

//include <cuda_runtime.h>

int main()
{
optix::Context context = optix::Context::create();
//int *cuda_buffer;

//cudaMalloc((void**)cuda_buffer, 5 * sizeof(int));
}

This also works, only use optix context.
But when I include them together, then the compiler complains.

include <stdio.h>
include <optix_world.h>

include <cuda_runtime.h>

int main()
{
optix::Context context = optix::Context::create();
int *cuda_buffer;

cudaMalloc((void**)cuda_buffer, 5 * sizeof(int));
}

In file included from /home/george/Work/software_projects/physics_based_radar/pb_optix_buffer_test/main.cpp:4:
In file included from /usr/local/cuda/include/cuda_runtime.h:90:
In file included from /usr/local/cuda/include/channel_descriptor.h:62:
/usr/local/cuda/include/cuda_runtime_api.h:252:17: error: unknown type name ‘cudaError_t’; did you mean
‘optix::cudaError_t’?
extern host cudaError_t CUDARTAPI cudaDeviceReset(void);
^~~~~~~~~~~
optix::cudaError_t
/usr/local/cuda/include/driver_types.h:1477:43: note: ‘optix::cudaError_t’ declared here
typedef device_builtin enum cudaError cudaError_t;
^
In file included from /home/george/Work/software_projects/physics_based_radar/pb_optix_buffer_test/main.cpp:4:
In file included from /usr/local/cuda/include/cuda_runtime.h:90:
In file included from /usr/local/cuda/include/channel_descriptor.h:62:
/usr/local/cuda/include/cuda_runtime_api.h:269:36: error: unknown type name ‘cudaError_t’; did you mean
‘optix::cudaError_t’?

So it looks like optix_world.h and cuda_runtime.h can not be included both.
If they can not, then how should I use both optix buffer and cuda buffer in the same file?
like this post showed,
I think I am missing some details here.
If you know more about this, please help me out. Thanks.

Best regards
George Liao