Yes, you got it right, I’m using Optix Prime. I’m sorry, that I didn’t mention about Prime.
Also I have to introduce our project. It is about thermal solution for satellites. To get idea about it please find our poster “GPU Accelerated Spacecraft Thermal Analysis” from GTC GTC 2024: #1 AI Conference
Project source code can be found here Bitbucket
Separating the individual float components of a float3 vector into different non-interleaved float arrays doesn’t make sense though.
That will ruin the memory accesses when gathering the individual floats compared to reading a float3, which are both read the same way but latter is more often in the same cache line.
I have to disagree. Code runs in warps of 32 threads and coalescing memory access is supper efficient. 32 threads x 4 bytes per float = 128 bytes cache line (global memory access line) exactly.
May be if there is strong thread divergency, then reading 2xfloat3 is good. And may be ray-tracing of (random) rays is that case.
But there is another code around ray-tracing. In my case ray-tracing time is 33% of pipeline. That is why I want my code work supper-efficiently.
The cache argument also holds if the two float3 for ray origin and ray direction, which are both needed at the same time to build a ray, lie next to each other.
2xfloat3 is not aligned to 128-bytes cache line. Coalescing memory access is impossible because 24 bytes are not aligned eventually. How all that works with 32 warp threads?
In my case even simple algorithms like filtering rays can’t utilize memory bandwidth more then 50-60%. That is why I suppose that 2xfloat3 is an issue.
From your question you seem to be using OptiX Prime and spend most of the time generating rays and handling the hit results?
Yes.
You do all that with CUDA on the GPU?
Yes. Almost everything.
Are you using asynchronous intersection queries?
No. Fixed synchronous pipeline significantly simplify code and architecture.
But I’m using async queries for scene update.
Do you have multiple queries in flight to work in parallel?
No. I’m going to implement multiple queries using multiple GPU - just replicating every workflow step to every GPU.
If all that processing takes too long, maybe it makes sense to use OptiX and leave the parallelization to that.
I moved almost everything to GPU if it is available. So GPU utilization is ~98% and power consumption is about ~70%.
You have control over the ray generation, ray tracing, and hit event processing (closest hit and any hit) in there and the programmable any_hit domain allows ray continuation.
Means you could possibly handle your whole algorithm in a single launch.
TLDR; It will be too serious vendor lock ; )
Is it possible to integrate anything else: conductivity, static heat source (engines, electronics) in that ray-tracing workflow? Actually, I’m not sure that it will by good idea anyway. Consider that application should work on a bare CPU also.
Conductivity is a graph solving problem. It is pretty efficient on CPU, but I’m not sure that I can implement it on GPU in a right way.
Also when using the high level OptiX API you can structure your buffers as you like.
You could for example put your attributes into individual buffers, or use one buffer and put them into a structure-of-arrays format.
Sorry, I don’t get that.
If you suspect that the loading of the data is a performance problem, you could use float4 instead, to get the vectorized load operation. Though because ray tracing structures are memory intense, it makes sense to save memory. Means what works best or at all depends on the scene size and underlying hardware.
Yes, padding to float4 may improve performance. Yet I will need additional step and memory to convert float4 rays into float3 rays for Prime. And this is an issue.
Another problem is padding itself. Best algorithm (with small amount of computing) will be 25% slower because of lost bandwith.
I’m normally using an array of structures for my vertex data and that was only 2.5% slower than a structure of arrays with the same data in one test, but is much more convenient to work with.
Really? I reverted my “optimized” CUDA code so many times, so I can believe in that. Main lesson I got with CUDA: just write simple and robust code, it will just work efficiently (or eventually work efficiently on new compute capabilities ; )
When working with OptiX Prime you’re forced to adhere to the built-in data structures for the query and hit buffers. With OptiX you can do what you like.
Sorry, but I don’t believe in magic : )
Am I right that Optix works on top of Optix Prime? Then it will perform that data conversion implicitly.
Let me summarize:
- First, thank you for detailed response!
- It is interesting, that struct of array is just 2.5% faster with Optix. May be this is because of implicit conversion to 2xfloat3 and vice-versa?
- I don’t understand why float3 memory access is not coalescing but still OK. May be massive parallelism, multiple blocks scheduled, caching and intensive computing are smoothing memory access problems.
- I admit that due to high thread divergence or some other reasons 2xfloat3 may be better for ray-tracing than struct of arrays.
- My profiling shows that I can’t get high memory bandwidth utilization with float3 data format (for faces and rays) even for computationally simple algorithms.
- Padding to float4 may improve performance, but may make it worse (due to 25% bandwidth loss). And I will need additional conversion step and additional memory anyway.