Struct of vectors instead of vector of structs in Optix API

stepan.tezyunichev · December 14, 2017, 1:20am

They say(*) that one should use struct of vectors instead of vector of structs for 3d-vector representation for better performance.

I mean instead of:

typedef struct {
      float3 origin;
      float3 direction;
    } *rays_t;

I should use something like that:

typedef struct {
      float* ray_origin_x;
      float* ray_origin_y;
      float* ray_origin_z;
      float* ray_direction_x;
      float* ray_direction_y;
      float* ray_direction_z;
    } rays_t;

There are good reason for that, like float3 is not GPU cache line aligned and so on.

But I don’t understand why there is no such interface for nVidia Optix. For example, there is only one ray definition format RTP_BUFFER_FORMAT_RAY_ORIGIN_DIRECTION which is float3:origin float3:direction.

I’m using millions of rays: ray generation, ray tracing and ray processing are 98% of my application workload. Optix consume ~33% of my application workload. And it looks like I can significantly speedup my part of code using that “struct of vectors approach”.

So, can we get Optix API with struct of vectors instead of vector of structs? Or I’m mistaken and float3 is just good enough?

It looks like a common sense for me, yet I failed to find a link to a trusted source like Parallel for all blog.

droettger · December 14, 2017, 9:07am

First, there are two APIs inside the OptiX SDK:

The high level OptiX ray casting API with flexible scene graph capabilities, a single ray programming model, and developer programmable program domains.
Then there is the low level OptiX Prime API which uses a ray wavefront programming model which is only used for intersection testing on a simpler scene structure with instancing support and only triangle primitives.

The define RTP_BUFFER_FORMAT_RAY_ORIGIN_DIRECTION you cited is from OptiX Prime (RTP for Ray Tracing Prime). All these buffer formats are hardcoded in OptiX Prime and the core takes care to load them as efficiently as possible.

Separating the individual float components of a float3 vector into different non-interleaved float arrays doesn’t make sense though. That will ruin the memory accesses when gathering the individual floats compared to reading a float3, which are both read the same way but latter is more often in the same cache line.
The cache argument also holds if the two float3 for ray origin and ray direction, which are both needed at the same time to build a ray, lie next to each other.

From your question you seem to be using OptiX Prime and spend most of the time generating rays and handling the hit results?
You do all that with CUDA on the GPU?
Are you using asynchronous intersection queries?
Do you have multiple queries in flight to work in parallel?

If all that processing takes too long, maybe it makes sense to use OptiX and leave the parallelization to that. You have control over the ray generation, ray tracing, and hit event processing (closest hit and any hit) in there and the programmable any_hit domain allows ray continuation.
Means you could possibly handle your whole algorithm in a single launch.

Also when using the high level OptiX API you can structure your buffers as you like.
You could for example put your attributes into individual buffers, or use one buffer and put them into a structure-of-arrays format.

If you suspect that the loading of the data is a performance problem, you could use float4 instead, to get the vectorized load operation. Though because ray tracing structures are memory intense, it makes sense to save memory. Means what works best or at all depends on the scene size and underlying hardware.

I’m normally using an array of structures for my vertex data and that was only 2.5% slower than a structure of arrays with the same data in one test, but is much more convenient to work with.
I’m also using a single buffer for these only. For OptiX I recommend to reduce the number of buffers because that’s normally faster and needs fewer operator invocations to access the data.

When working with OptiX Prime you’re forced to adhere to the built-in data structures for the query and hit buffers. With OptiX you can do what you like.
There is also an optixRaycasting example inside the OptiX SDK which shows how to use OptiX for intersections only similar to OptiX Prime.

stepan.tezyunichev · December 15, 2017, 9:50pm

Yes, you got it right, I’m using Optix Prime. I’m sorry, that I didn’t mention about Prime.

Also I have to introduce our project. It is about thermal solution for satellites. To get idea about it please find our poster “GPU Accelerated Spacecraft Thermal Analysis” from GTC GTC 2024: #1 AI Conference

Project source code can be found here Bitbucket

Separating the individual float components of a float3 vector into different non-interleaved float arrays doesn’t make sense though.
That will ruin the memory accesses when gathering the individual floats compared to reading a float3, which are both read the same way but latter is more often in the same cache line.

I have to disagree. Code runs in warps of 32 threads and coalescing memory access is supper efficient. 32 threads x 4 bytes per float = 128 bytes cache line (global memory access line) exactly.
May be if there is strong thread divergency, then reading 2xfloat3 is good. And may be ray-tracing of (random) rays is that case.
But there is another code around ray-tracing. In my case ray-tracing time is 33% of pipeline. That is why I want my code work supper-efficiently.

The cache argument also holds if the two float3 for ray origin and ray direction, which are both needed at the same time to build a ray, lie next to each other.

2xfloat3 is not aligned to 128-bytes cache line. Coalescing memory access is impossible because 24 bytes are not aligned eventually. How all that works with 32 warp threads?
In my case even simple algorithms like filtering rays can’t utilize memory bandwidth more then 50-60%. That is why I suppose that 2xfloat3 is an issue.

From your question you seem to be using OptiX Prime and spend most of the time generating rays and handling the hit results?

Yes.

You do all that with CUDA on the GPU?

Yes. Almost everything.

Are you using asynchronous intersection queries?

No. Fixed synchronous pipeline significantly simplify code and architecture.
But I’m using async queries for scene update.

Do you have multiple queries in flight to work in parallel?

No. I’m going to implement multiple queries using multiple GPU - just replicating every workflow step to every GPU.

If all that processing takes too long, maybe it makes sense to use OptiX and leave the parallelization to that.

I moved almost everything to GPU if it is available. So GPU utilization is ~98% and power consumption is about ~70%.

You have control over the ray generation, ray tracing, and hit event processing (closest hit and any hit) in there and the programmable any_hit domain allows ray continuation.
Means you could possibly handle your whole algorithm in a single launch.

TLDR; It will be too serious vendor lock ; )

Is it possible to integrate anything else: conductivity, static heat source (engines, electronics) in that ray-tracing workflow? Actually, I’m not sure that it will by good idea anyway. Consider that application should work on a bare CPU also.
Conductivity is a graph solving problem. It is pretty efficient on CPU, but I’m not sure that I can implement it on GPU in a right way.

Also when using the high level OptiX API you can structure your buffers as you like.
You could for example put your attributes into individual buffers, or use one buffer and put them into a structure-of-arrays format.

Sorry, I don’t get that.

If you suspect that the loading of the data is a performance problem, you could use float4 instead, to get the vectorized load operation. Though because ray tracing structures are memory intense, it makes sense to save memory. Means what works best or at all depends on the scene size and underlying hardware.

Yes, padding to float4 may improve performance. Yet I will need additional step and memory to convert float4 rays into float3 rays for Prime. And this is an issue.
Another problem is padding itself. Best algorithm (with small amount of computing) will be 25% slower because of lost bandwith.

I’m normally using an array of structures for my vertex data and that was only 2.5% slower than a structure of arrays with the same data in one test, but is much more convenient to work with.

Really? I reverted my “optimized” CUDA code so many times, so I can believe in that. Main lesson I got with CUDA: just write simple and robust code, it will just work efficiently (or eventually work efficiently on new compute capabilities ; )

When working with OptiX Prime you’re forced to adhere to the built-in data structures for the query and hit buffers. With OptiX you can do what you like.

Sorry, but I don’t believe in magic : )
Am I right that Optix works on top of Optix Prime? Then it will perform that data conversion implicitly.

Let me summarize:

First, thank you for detailed response!
It is interesting, that struct of array is just 2.5% faster with Optix. May be this is because of implicit conversion to 2xfloat3 and vice-versa?
I don’t understand why float3 memory access is not coalescing but still OK. May be massive parallelism, multiple blocks scheduled, caching and intensive computing are smoothing memory access problems.
I admit that due to high thread divergence or some other reasons 2xfloat3 may be better for ray-tracing than struct of arrays.
My profiling shows that I can’t get high memory bandwidth utilization with float3 data format (for faces and rays) even for computationally simple algorithms.
Padding to float4 may improve performance, but may make it worse (due to 25% bandwidth loss). And I will need additional conversion step and additional memory anyway.

droettger · January 8, 2018, 11:51am

Back from vacation.
Ok, so you’re using CUDA kernels highly optimized for memory accesses in your code around the actual OptiX Prime ray tracing. That wasn’t apparent to me from the initial post so I was slightly confused about the potential benefit of your structure of float arrays.

To 2.)
What I had been comparing was the order of vertex attributes in my OptiX renderer, not the rays.
Going from structures of arrays of float4 data for the four attributes I’m using (position, tangent, normal, texcoord) to an array of a per-vertex structure with these fields was only a 2.5% different in overall rendering performance in my highly divergent path tracer. Performance is limited mostly by the traversal there.

In OptiX Prime the query and hit result formats are hardcoded and there is no way to feed in your structure of arrays into that directly.
But, as said, the high level OptiX API does allow to structure input/output buffers as you like and has multiple program domains (ray generation, closest hit, any hit, miss, etc.) which are fully under your control by implementing the necessary CUDA code!

There is an OptiX example named optixRaycasting inside the OptiX SDK which shows how to use OptiX for ray intersection testing only, similar to an OptiX Prime ray wavefront use case, but with all the additional flexibility of the high level OptiX API (including ray continuation via any-hit programs, custom primitive intersections, more flexible scene graph, etc.).
The ray-generation program and buffer layouts in OptiX are freely programmable, which means you can implement a ray generation program which constructs the ray origins and directions from any buffer layout you desire, including your structure of float arrays. The ray-generation program would just copy the six floats from your structure of arrays before calling rtTrace() with the resulting ray.

The ray generation program in OptiX normally has perfect occupancy because all threads are running the same instructions. I would start measuring an OptiX 1D launch for that linear memory layout.

The BVH traversal core of OptiX and OptiX Prime is the same when using Trbvh, so the resulting intersection performance can be expected to be in the same ballpark.

stepan.tezyunichev · January 9, 2018, 9:20pm

Hello again!

I got it. My initial question is malformed and has almost no sense:

It was about Optix API
But Optix accepts buffers of my choice
And this is your initial point.

Sorry for confusion : (

Optix interface looks pretty attractive. But I need to understand several things before.
So there are my current (final?) questions.

I. Is Optix based on Optix Prime under the hood?
I believe it should. Then there should be implicit conversion: “The ray-generation program would just copy the six floats from your structure of arrays before calling rtTrace() with the resulting ray”.

But (in case Optix is based on Optix Prime API) it should perform that conversion in batches using GPU global memory. In that case I can perform such conversion explicitly (in my simulation engine).
And I can amortize conversion overhead using mini-batches in several threads, CUDA streams, whatever.

Another way is on-the-fly conversion.
Like in CUB fancy iterators: CUB: cub::TransformInputIterator< ValueType, ConversionOp, InputIteratorT, OffsetT > Class Template Reference
But I can’t find any Prime API for such on-the-fly conversion. I guess this is because it is C-like and does not support any kind of C++ templates or lambdas.

II. Is there any chance that Optix Prime API will accept struct of arrays for ray buffers in the future?
Then I can get rid of the conversion (memory bandwidth) overhead.
And it will be a win! : )

droettger · January 10, 2018, 5:38pm

I.) It’s not based on OptiX Prime, it just shares some code. That’s exactly the idea and advantage of the OptiX high-level API abstraction which allows arbitrary internal implementations without changing the user API, and that happened a lot already when comparing the initial version 1.0 to the current version 5.0.

When trying this with the OptiX API, you provide the ray generation program and fully control the input and output buffer layouts and how and where in your application you want to build the ray query buffers.

OptiX has a single-ray programming model and abstracts any scheduling and some more CUDA specifics (grid and warps, synchronizations, shared memory, ballot instructions, and the like; see the caveats chapter inside the OptiX Programming Guide). While you do not have all CUDA hardware programming features available due to that abstraction, it makes the OptiX device code programming rather easy because you just use standard CUDA C++ for the individual OptiX program domains’ code.