Variable length ray payload

I’d like to create a ray payload containing an array whose length is unknown at compile time. The array’s length is uniform within a launch, but could change from one launch to the next.

Ordinarily, I’d use malloc, but the OptiX documentation indicates that will not work. Any suggestions for how to implement this?

I’d like something nicer than just creating an array of the maximum possible length. The array could be very long (2000 float entries) but will usually be much smaller (around 100 float entries).

The first thing that came to my mind: How about using a pointer to a global, preallocated rtBuffer?

You could either store the pointer in the ray payload or access the buffer using the launch index.

Aside from that:
What problem are you actually trying to solve? Do the calculations which will compute your up to 2000 float entries really have to take place in the hit program?

This sounds like a start. Unfortunately, my ray tracer is recursive, not iterative, so I have to plan for there to be more room in my global rtBuffer than just for my primary rays. I don’t know how many reflected rays there will be initially, so I’d have to take a guess and give myself an artificial limit for the number of rays to store arrays for.

What sort of performance hit do I get for using global memory?

I’m keeping track of the fraction of total illumination contributed by each source in the scene. Since the number of sources isn’t known at compile time, the size of the array is variable. The hit program computes the illumination provided by the intersected source, so my thought is that it would also be the logical place to populate the array. If I wait until later, I no longer have a way to differentiate diffuse reflections from different sources.

Alternately, I could run 2000 separate launches, each with only one source, but I’m hoping this is faster.

I guess you have to benchmark that, especially in comparison to having an enormous OptiX stack size when using such a huge ray payload recursively.

Is it possible for you to change to an iterative algorithm? I did this with my path tracer which both helped performance and memory management.

An iterative path tracer would have better performance, but the accuracy of the results would suffer. That’s the main consideration.

Why does recursion vs. iteration affect accuracy?

I tried m_sch’s suggestion of creating a preallocated scratch space rtBuffer to store my large arrays. My rtBuffer has type (RT_BUFFER_INPUT_OUTPUT | RT_BUFFER_GPU_LOCAL). It works quite well when all access to the array is from within the same rtProgram. However, when I try to write to the array from a second rtProgram, either using my rtLaunchIndex to identify my element of the array, or by passing a pointer into the array through my ray payload, I get the following error:

Unknown error (Details: Function "_rtContextLaunch2D" caught exception: Encountered a CUDA error: result returned (700): Unknown, [6619204])

My system: 2x Tesla K40, Windows 7, CUDA 6.5, OptiX 3.8 beta, driver 341.44

Do you have a SSCCEE (http://sscce.org/) which reproduces the error?

I’ve solved the issue by taking an approach where the pointer into the rtBuffer is recalculated based on the rtLaunchIndex every time it is needed. Apparently any sharing of pointers into global memory between rtPrograms is enough to cause an error.

You are using multiple GPUs, right? I guess this could cause the problem in your case, maybe you can try your original approach with just one GPU.

Also if the buffer you’re actually reading as result on the host has the RT_BUFFER_GPU_LOCAL property, that won’t work either.
RT_BUFFER_GPU_LOCAL means the buffer can only be written on the host and for multi-GPU configurations writes to them are not coherent because each GPU has its own copy. Only use that for scratch buffers you need to read and write locally per device.

Calculating unique output buffer indices per actual launch index should work with any number of local GPUs.