Optix based collider performance

hello

I am writing a continuous collision detection algorithm, based on the
Optix ray tracer. Each particle and each vertex of the collider has a
start point, constant velocity over a time step dt. The collider is a
triangular mesh, so I want to detect the barycentric coordinates and the
moment of collision between the segment (path of particle) and the
moving triangle.

The algorithm is simple. I have a ray generator program that works
pretty much like a regular ray tracer.
The result buffer format is float4, containing the moment of collision,
the triangle id and the barycentric coords alpha beta.

the bounding box program considers both the start and end points of
vertices for each triangle, so the resulting AABB is much bigger than
the static case.

for simplicity in this thread, i will assume the triangles are
translating only. subtracting this movement vector from the particle
path, we can perform a simple segment X triangle intersection.

the results so far seem ok, but the performance is below the expected.

I also have a CPU based collider, which performs closest point detection
and uses binary tree to sort the triangles. It has several optimization
tricks that i dont know if they are implemented in Optix, but i believe
they are. For example, shortcuts between leaves and the next tree node…

I would like to know if this result seems correct or is there something
silly that I am missing out here…?

the collider has about 10k triangles and 5k vertices, and we tested with
20k and 200k particles.
with 20k, the CPU collider runs at 3 fps. The Optix collider, at 1fps in
avg.
with 200k, the fps is too small but we did not notice any difference in
the performance ratio.
maybe the scene is still too small? even so, shouldnt i get a little
better performance?

A few other things I am doing:

Sorting the particles to improve ray coherency.
Setting the “use_fast_math” option.
Setting the tracer to “refit” only.
Using Trbvh acceleration.
Disabling all exceptions and prints. (I have checked that there are no
exceptions being thrown)
Compacting the positions in a float3 buffer. Changing everything to
float4 buffers is a little bothersome, but still i dont think it would
make much of a difference… or would?

is there anything else I could obviously try but i am missing out here?

thank you.

“the collider has about 10k triangles and 5k vertices, and we tested with 20k and 200k particles.
with 20k, the CPU collider runs at 3 fps. The Optix collider, at 1fps in avg. with 200k, the fps is too small but we did not notice any difference in the performance ratio. maybe the scene is still too small? even so, shouldnt i get a little better performance?”

Sorry, but performance numbers are meaningless without system configuration information. Please provide at least these information:
OS version, installed GPU(s), installed CPU(s), NVIDIA display driver version, OptiX version, CUDA toolkiit version to generate the PTX code.

How many rays are you actually shooting in that configuration with 10k triangles and 20k resp. 200k particles?

“The result buffer format is float4, containing the moment of collision, the triangle id and the barycentric coords alpha beta.”

Are you using your own custom triangle intersection algorithm?
Because the triangle intersection routines inside the OptiX examples return beta and gamma.

The intersection program is the most often called program. Performance of that is crucial.

“Setting the “use_fast_math” option.”

You’re not using doubles, are you?

It’s always recommended to set the “use_fast_math” for OptiX PTX code, because otherwise you’ll get the much slower but more precise calculations for all square root and trigonometric calculations. You can see that inside the generated PTX code yourself. If there are no “approx” versions of the trigonometric functions, you’re getting the performance penalty.

My gut feeling is that 1fps sounds much too low for something with 10k triangles testing against the trajectories of 20k particles if that means you’re testing 20k ray segments against the moving triangle mesh, but I’m normally using the highest end Quadro boards.

“the bounding box program considers both the start and end points of vertices for each triangle, so the resulting AABB is much bigger than the static case.”

That is definitely going to affect performance. If the bounding box AABBs are much bigger, then many more of them will be hit during traversal.

Experiments you can do to gather more data points:
What is the performance of a static triangles scene with moving particles only?
What happens when changing the timesteps and thereby the size of the AABB volumes and ray segment lengths.

“Compacting the positions in a float3 buffer. Changing everything to float4 buffers is a little bothersome, but still i dont think it would make much of a difference… or would?”

Loads and stores of float4 can be vectorized, float3 is handled as three individual floats. Means there can actually be a performance benefit of using float4 instead of float3. Also see the OptiX Programming Guide Chapter 11. Performance Guidelines on that topic.

Related topic: [url]https://devtalk.nvidia.com/default/topic/997269/?comment=5098671[/url]

windows 7 SP1, intel xeon E5-2620 v3 2.4GHz (2 processors), quadro K4200, driver 376.51

Optix 4.1.1, CUDA 8.0

I shoot 1 ray per particle, there is no recursion.

i am using the “intersect_triangle_branchless”. I just compute the alpha from the beta and gama because that is the value handled by the solver running the collision.

in fact, i also have an impl. of the collider that considers particles with non zero radius. this one uses a custom leaf/ray intersection and is much slower, but the radius=0 is too slow already. If i find the problen for this case, the r!=0 case would improve too.

I am using floats only. I checked the ptx for unwanted conversions.

the simulation behind the collision is heavy already, so the fps might not be very accurate. i will make some adjustments and measure the time of the collider only, in milliseconds, then i will put it here.

i will inspect the BB construction. indeed, they are overly big. the collider is running in a Maya plugin, which gets the mesh info once each frame. inside a frame, the simulation runs a few substeps, but I was using the triangles at the start and end of the frame to compute the BB only once. at first i thought that refitting the BVH only once and running the collision a few times inside the frame would be faster, but i will try to decrease the BB size and refit the BVH each substep. i will put the results here soon.

the whole simulation is using vectors with the xyz compacted. if I use float4 in the collider, i would have to copy the data with some stride instead of just passing by the original array. if there is a big advantage in doing that, i can change the code, but i thought the main problem could be with the algorithm itself, the implementation of the program, or some specific configuration that i was missing out…

anyway, i will make more precise measurements and post the results soon.

thank you for your reply.

“which gets the mesh info once each frame. inside a frame, the simulation runs a few substeps, but I was using the triangles at the start and end of the frame to compute the BB only once”

Is the 1 fps result the overall performance when running through all frames including the rebuild of the acceleration structures (AS) or is that just the performance of the actual ray tracing on a single frame?

How fast is the actual ray tracing on a single frame?
E.g. when just running the same intersection tests without the simulation a few hundred times on the same frame data with the inflated AABBs.
The AS build happens during the first launch. You can do a dummy launch with zero size to trigger that and then measure only the real ray tracing performance afterwards.

If that ray tracing alone is fast, you’re limited by the AS build itself.
If the topology doesn’t change, refitting could help there.
Also on triangle geometry, make sure that when you’re using Trbvh that you set the acceleration properties properly to invoke the faster specialized BVH builder. Look at the OptiX Programming Guide Chapter 3.5.3. Properties. More explanations here: [url]https://devtalk.nvidia.com/default/topic/1022634/?comment=5203884[/url]

If that raytracing alone is slow already, you could be traversal, intersection, or shading limited.
Given that you’re using inflated AABBs, the traversal performance would be my first bet.
That can be determined by testing the simulation on a static mesh (smaller AABB than over the whole frame as tested above), e.g. only taking the triangles from the beginning of the frame.

It shouldn’t be a problem to use float3 for the input buffers. No need to manually copy that around.
It’s just that float4 output buffers can be a lot faster when using them in a multi-GPU environment.

Is any of that input or output data already in VRAM or goes everything through the host RAM?

A Quadro K4200 is a mid-range board from three GPU generations ago and your display driver is 10 month old. You could try current drivers, but don’t expect a huge improvement by the factors you need.

Hi,

yes, this is the overall Maya FPS, animation, simulation… everything
included. Let me show more detailed:

Simulation only: 3.5 FPS
Simulation + my CPU collider: 2.5 FPS
Simulation + Optix collider (inflated BB): 1 FPS
and now, after the changes…
Simulation + Optix collider (tight BB): 1.5 FPS

thinking better, the CPU collider is not really comparable because the
algorithm is not the same, and the result ends up being different… it
might be faster by accident, so I will ignore it from now on.

Before the changes, I compute inflated BBs and refit the AS once per
frame and run the collision each subframe.
After the changes, I compute tight BBs, refit the AS and run the
collision test each subframe.

I am using a second entry point to interpolate the Start and End
collider’s shape. Each substep, I interpolate them and mark the
acceleration structure as dirty. Then, I call the collider.

The topology only changes with user input (when they change the
collider), so I am mainly setting the Trbvh with refit on, as described
in the link you gave.

I have measured more precisely (I guess) the collision only and will
post the results now.
The times vary wildly, so I will just put here the avg.

** CPU->GPU transfer, GPU->CPU transfer, Particle sorting and
interpolation are very fast (less than 1ms) and completely neglectable
here. Those times were taken from the 10th frame of the simulation.

everything is in miliseconds.

======= inflated BBs ==========

moving collider:
[4 subframes] refit + traversal: avg 113 max 217 min 53
[40 subframes] refit + traversal: avg 68 max 180 min 29

standing collider:
[4 subframes] refit + traversal: avg 121 max 200 min 49
[40 subframes] refit + traversal: avg 62 max 151 min 24

======= tight BBs ==========

moving collider:
[4 subframes] refit + traversal: avg 180 max 187 min 168
[40 subframes] refit + traversal: avg 74 max 189 min 34

standing collider:
[4 subframes] refit + traversal: avg 177 max 186 min 171
[40 subframes] refit + traversal: avg 72 max 168 min 35

tight is slower than inflated! This is unexpected, HOWEVER, the FPS
actually improved a lot.

increasing the iterations in 1 frame will make the rays shorter, and as
such, the tracer faster. The result is consistent.

A static collider has minimal BB compared to the moving one, but here
they do not get any particularly faster. My guess is that the motion is
not large enough to make a difference, or the number of triangles is too
small.

nevertheless, for all the cases, the new tightly computed BB is “slower”
than the inflated BB. My guess is that I am not properly measuring the
times…

after googling, this is how I am doing:

cudaEvent_t eStart, eStop;
cudaEventCreate(&eStart);
cudaEventCreate(&eStop);

cudaEventRecord(eStart, 0);
sortParticles(); //improve ray coherency here
cudaEventRecord(eStop, 0);
cudaEventSynchronize(eStop);
cudaEventElapsedTime(&sortTime, eStart, eStop);

cudaEventRecord(eStart, 0);
rtContextLaunch1D( mRTOptixContext, 1, numColliderVertices );
//interpolate the Start and End collider
cudaEventRecord(eStop, 0);
cudaEventSynchronize(eStop);
cudaEventElapsedTime(&interpolationTime, eStart, eStop);

rtAccelerationMarkDirty(mRTBVHAcceleration);

cudaEventRecord(eStart, 0);
rtContextLaunch1D( mRTOptixContext, 0, numParticles ); //collision
cudaEventRecord(eStop, 0);
cudaEventSynchronize(eStop);
cudaEventElapsedTime(&collisionTime, eStart, eStop);

if it is not the correct way (causing the noisy result) could you point
me to the right direction?

also, I am using float4 buffers for the output, so I think there is no
problem here.
everything goes through the host, but in a future version, I am planning
to port the entire simulation to GPU as well.

I really appreciate your help.
thank you.

Is the sortParticles() routine a CUDA kernel?
If not, I see no reason for any of the CUDA runtime code to do the benchmarking there.

I normally just measure performance with the High Frequency Counter under Windows and use this Timer class from our nvpro-pipeline for that:
https://github.com/nvpro-pipeline/pipeline/blob/master/dp/util/Timer.h
https://github.com/nvpro-pipeline/pipeline/blob/master/dp/util/src/Timer.cpp

All you need is a variable of that class and these two calls in your code then:

m_benchmarkTimer.restart();
// Do something you want to benchmark.
double seconds = m_benchmarkTimer.getTime();

I’m unsure what exactly you’re doing in the entry point index 1 “interpolate the Start and End collider”, esp. the data flow in buffers would be interesting.

Thanks for the tip! I will try it tomorrow and if the results are different, will update the info here.

I am just avoiding excessive repetitions of slerp. When computing the interpolations on-the-fly, I am doing it 6 times for each triangle in the BB routine, plus, potentially (6 x many times) for each particle, in the intersection routine. And the number of particles is much bigger than the number of vertices.
Now, I interpolate each vertex only twice and reuse the result.
Tomorrow i will check if I am being too paranoid about it, maybe computing on-the-fly is not so bad.

ok, here are updated results.

I am benchmarking as you suggested.
also, I tested with the a simple ray tracer to have an idea of the max performance I can achieve in the current form.

the interpolation is taking about 1 ms, and tree traversal gets a boost of 1 or 2 ms only… not really a big difference, after all, but still better than nothing.

====== ray tracing =======

particles coherency on:
sort - 2ms
traversal - 35ms

coherency off:
traversal - 48ms

======== collision =======

coherency on:
sort - 2ms
traversal - 52ms

coherency off:
traversal - 82ms

everyhing was taken on the first frame with 1 iteration only, to make sure the input is the same.

according to this, the limit would be around 35ms.

here is the code i am using to test a simple tracer

rtBuffer<uint3, 1> colliderVertices;
rtBuffer<float3, 1> colliderStartPoints;
rtDeclareVariable(unsigned int, backFaceCull, , "Flag to ignore
collisions on back face");

RT_PROGRAM void intersectRay(int primId) {
    uint3 vid = colliderVertices[primId];
    float3 T[3];
    T[0] = colliderStartPoints[vid.x];
    T[1] = colliderStartPoints[vid.y];
    T[2] = colliderStartPoints[vid.z];
    float3 N = optix::cross(T[1]-T[0],T[2]-T[0]);
    float Ndot = optix::dot(N, ray.direction);
    if (!backFaceCull || (Ndot < 0.0f)){
        float t,beta,gama;
        if(optix::intersect_triangle_branchless(ray,T[0],T[1],T[2],N,t,beta,gama)) {
            if (rtPotentialIntersection(t)){
                hitInfo = optix::make_float4( (float)primId, beta, gama, t);
                rtReportIntersection(0);
            }
        }
    }
}

RT_PROGRAM void bounds (int primId, float result[6]) {
    uint3 vid = colliderVertices[primId];

    optix::Aabb* aabb = (optix::Aabb*)result;

    aabb->m_min = colliderStartPoints[vid.x];
    aabb->m_max = aabb->m_min;
    aabb->include(colliderStartPoints[vid.y]);
    aabb->include(colliderStartPoints[vid.z]);
}

can it get better than that?
the scene has precisely
28944 particles
10312 triangles with 5192 vertices

That looks pretty optimal.
There is no need to calculate the normal vector N and Ndot before the intersection if backFaceCull is false.

If everything else about the Trbvh acceleration structures is as I recommended so that AS builds are optimal, there is not much I can help with further with the given information.

If the simulation runs on the host and the results are copied back and forth, that will have an impact which is not present in the CPU version because the memory is in host anyway, but that small number of results shouldn’t saturate the PCI-E bus.

If the simulation alone will only run at only 3.5 fps anyway, bringing that onto the GPU with CUDA kernels and using CUDA interop for the data exchange with OptiX could have more potential for improvements.

Yes, the simulation, the animation, the visualization… everything to the GPU!
my main concern is memory. Depending on the scene, it might take a few GBs…
but before that, I have to figure out what is the problem here.
No matter how I measure it, the tree traversal alone is taking a big chunk of the peformance (3X more than the simulation). I would like to be able to visualize what the BVH looks like. (also, I would like to measure the AS rebuild/refit performance as well)

I have found a strange behavior, when I first open the scene, before running the collision for the first time, the AS is being built with all the vertices at the origin. I think this might be the reason for the low performance. I am building the AS only once, and turn refit “on” until there is a change in the topology. If the tree is being mistakenly built with vertices at zero, the BVH itself would be of low quality.

Next monday, I will try to figure out what is happening and update here.

Thanks again for your help!

hello again,

I have fixed a minor issue unrelated to Optix, and now the weird behavior is gone.

nevertheless, the performance remains unaltered. I ran out of ideas, would you mind to confirm if the setup code is correct? i will make it concise. I think every aspect that could be affecting the performance is covered.

void create(){

    rtContextCreate( &mRTOptixContext );
    rtContextSetExceptionEnabled( mRTOptixContext, RT_EXCEPTION_ALL, 0);
    rtContextSetPrintEnabled( mRTOptixContext, 0 );
    rtContextSetRayTypeCount( mRTOptixContext, 1 );
    rtContextSetEntryPointCount( mRTOptixContext, 1 );

    rtProgramCreateFromPTXFile( mRTOptixContext,
pathToRayGenProgram.c_str(), "RayGen", &rayGenProgram );
    rtContextSetRayGenerationProgram( mRTOptixContext, 0, rayGenProgram );

    rtProgramCreateFromPTXFile( mRTOptixContext,
pathToRayGenProgram.c_str(), "particle_miss", &particleMissProgram );
    rtContextSetMissProgram( mRTOptixContext, 0, particleMissProgram );

//declare buffers and variables
//..
//..
//end declare buffers and variables

    rtGeometryCreate( mRTOptixContext, &mRTDynamicMesh );
    rtProgramCreateFromPTXFile( mRTOptixContext, pathToGeometryPTX.c_str(),
"bounds", &mRTBBProgram);
    rtGeometrySetBoundingBoxProgram( mRTDynamicMesh, mRTBBProgram );
    rtProgramCreateFromPTXFile( mRTOptixContext, pathToGeometryPTX.c_str(),
"intersectRay", &mRTintersectRayProgram);
    rtGeometrySetIntersectionProgram( mRTDynamicMesh, mRTintersectRayProgram );

    rtProgramCreateFromPTXFile( mRTOptixContext, pathToMaterialPTX.c_str(),
"closest_particleHit", &closestHitProgram );
    rtMaterialCreate( mRTOptixContext, &mRTHitInfoMaterial );
    rtMaterialSetClosestHitProgram( mRTHitInfoMaterial, 0, closestHitProgram);

    rtGeometryInstanceCreate( mRTOptixContext, &instance );
    rtGeometryInstanceSetGeometry( instance, mRTDynamicMesh );
    rtGeometryInstanceSetMaterialCount( instance, 1 );
    rtGeometryInstanceSetMaterial( instance, 0, mRTHitInfoMaterial );

    rtAccelerationCreate( mRTOptixContext, &mRTBVHAcceleration );
    rtAccelerationSetBuilder( mRTBVHAcceleration, "Trbvh" );
    rtAccelerationSetProperty( mRTBVHAcceleration, "chunk_size", "0" );
    rtAccelerationSetProperty( mRTBVHAcceleration, "refit", "0" );

    rtGeometryGroupCreate( mRTOptixContext, &geometrygroup );
    rtGeometryGroupSetChildCount( geometrygroup, 1 );
    rtGeometryGroupSetChild( geometrygroup, 0, instance );
    rtGeometryGroupSetAcceleration( geometrygroup, mRTBVHAcceleration );

    rtContextDeclareVariable( mRTOptixContext, "top_object", &top_object );
    rtVariableSetObject( top_object, geometrygroup );
}

void collide(){
    if (mbValidateContext) {
        rtContextValidate( mRTOptixContext );
        rtAccelerationSetProperty( mRTBVHAcceleration, "refit", "0" );
        mbValidateContext = false;
    }
    if (mbColliderDirty){
        rtAccelerationMarkDirty( mRTBVHAcceleration);
        mbColliderDirty = false;
    }

    //(...)

    rtContextLaunch1D( mRTOptixContext, 0, muiNumRays );
    rtAccelerationSetProperty( mRTBVHAcceleration, "refit", "1" );
}

//////

RT_PROGRAM void RayGen() {
    unsigned int pid = particleRemapping[launchId];

    float4 hitResult = optix::make_float4(-1.0f, 0.0f , 0.0f, 1.0f);
//triangleId beta gama hitTime
    float3 org = particlesPositions[pid];
    float3 dir = particlesDestination[pid] - org;

    optix::Ray ray = optix::make_Ray(org, dir, 0, -originOffset, 1.0f);
    rtTrace(top_object, ray, hitResult);

    collisionResult[pid] = hitResult;
}

also, in the end there is the ray generation program. I am not using normalized ray.direction, but I guess this is not a problem, right?

if there is nothing wrong, i will assume that such performance is the limit I can get with this machine.

thanks, I appreciate your help.

Here’s what I would change:

“I am not using normalized ray.direction, but I guess this is not a problem, right?”
The OptiX documentation doesn’t mention this, but the low-level OptiX Prime API requires normalized ray directions for correct results and OptiX shares the AS builder and traversal code with that, so it’s a good idea to send normalized ray directions in the ray generation program. All of the OptiX SDK examples do that.
It’s potentially more precise to use normalize directions as well, because the transformation into local space require inversions.
The direction field of the variable with rtCurrentRay semantic is always normalized inside the different program domains which use different coordinate spaces.
That might only be the case when actually sending normalized directions to begin with.
I’ve never used unnormalized ray directions.
Please try if that changes anything:
float3 dir = optix::normalize(particlesDestination[pid] - org);

rtContextValidate( mRTOptixContext );
Never call that inside a performance critical path.
That should either only be called once after the whole scene setup is done or for debugging only.
Looks like you’re only doing that one time.

rtAccelerationSetProperty( mRTBVHAcceleration, “chunk_size”, “0” );
There is no need to call that with value 0. You can simply leave it away. I assume this is debug code.

rtAccelerationSetProperty( mRTBVHAcceleration, “refit”, “0” );
You’re setting this twice. It shouldn’t be set at all if you want to refit later anyway.

rtAccelerationSetProperty( mRTBVHAcceleration, “refit”, “1” );
Do not set this per launch. Set it only once and before any launch.
It’s a state which gets handled in the next launch if the AS is dirty and has been built before.
Means if you’re constantly updating the geometry, then the refit should be persistent and set only once.

optix::Ray ray = optix::make_Ray(org, dir, 0, -originOffset, 1.0f);
Hmm, I’ve never used a negative t_min. In principle the interval tests shouldn’t care, but I don’t know if there is any place inside OptiX which expects positive values there.
Maybe check if there are any invalid ray exceptions.

You can try to use normalized directions, a shifted origin to get the t_min >= 0.0f and adjust the t_max to the effective ray length before its normalization.

Other than that, 20k rays aren’t a high enough workload to saturate current GPUs.
I’m guessing that the AS build and the traversal on the inflated AABB is the limiting factor.
To be able to measure that I would need a reproducer project or an OptiX API Capture (OAC) trace of a few frames.
This thread contains a description how to produce one: [url]https://devtalk.nvidia.com/default/topic/803116/?comment=4436953[/url]
That trace would also contain exactly the OptiX calls you’re doing for an easier analysis. You can use that yourself as well.

Bingo!

I had to reimplement the programs to handle normalized directions, everything is running at 5ms now.

the scene scale is of about 2 meters, with particles moving each frame less than a few mms. I was setting tmin and max to 0.0 1.0, so I had pretty much huge rays hitting all the boxes.

But still, i would like to work with negative ranges and unormalized rays… the code is much less readable now.

the particle sorting (on CPU) is barely having any effect, so I guess I should omit it from now on.

as for the others suggestions, I will adjust them here. thank you.

Awesome! Now that’s more like what I would have expected. :-)

Unfortunately the most efficient code is not always the most readable one. The majority of algorithms require normalized directions anyway and it’s unlikely that this is going to change in OptiX.