OptiX bad performance with dynamic objects

Jakub_D_Kolesik · June 5, 2019, 7:55am

System:
OS: OpenSuse 42.3
GPU: Quadro M5000M
Driver: 418.30 / 430.14
RAM: 32G
CPU: Intel(R) Core™ i7-6920HQ CPU @ 2.90GHz
OptiX: 5.0.1
CUDA: 9.0.176.4

Hi!
Our company is using OptiX for the simulation of LIDAR/RADAR sensors in real time. We have a complex simulation with vehicles, pedestrians and huge cities (5+km^2).

If we run OptiX on a huge database with only static geometry, everything runs smoothly ~80FPS. The problem arises when dynamic objects enter the scene. Already with 10 vehicles, the performance drops to 10-15FPS. Each vehicle is under a transform node and we only update the matrix in this node for every frame. When increasing the usage report level, I found out that the Acceleration update takes 30-40ms after adding dynamic objects, and the CPU core load gets to 100%. Profiling showed that more than 50% of the CPU load is caused by OptiX. The acceleration structure is Trbvh. I also tried to set acceleration->setProperty(“refit”, “1”) on all vehicles and the static scene (as mentioned in some posts in this forum), but it actually made the problem even worse and the performance dropped to ~5FPS. So basically this looks like a CPU bottleneck issue, but I cannot distribute the load to multiple cores. Adding multiple threads to OptiX won’t help either since we are bound to 1 CPU core, which should be enough.

Some important facts:

The vehicles are high poly (10-20k triangles)
The hierarchy depth of the vehicle nodes is around 3-5 since we separate wheels, doors, windows etc… I know we should flatten this to one node, but I’m not sure if this is the issue.

Can you please suggest some optimisation we could try? I cannot imagine that 10 dynamic objects cause such a performance drop watching all the PC games using raytracing with many players without a problem.

Thanks and all the best,
Jakub

dhart · June 5, 2019, 9:21pm

Hi Jakub,

This is a known problem in OptiX that is handled differently than in games which use DXR & VKR. The problem is with how we manage updates to scene graph and device memory, which is a feature that doesn’t exist in those other APIs. We were trying to make things easier on the user by including memory management, but there are some workflows that have performance problems. OptiX has historically been better at batch ray tracing than real time ray tracing in dynamic scenes. We’re in the process of permanently fixing that.

First off, we’re sorry about it. This “bug” exists currently in OptiX 6 as well, and there is a partial fix being released in a driver update soon for OptiX 6. And we have a larger structural update in the works that fully fixes this problem. But given that you’re on OptiX 5, I’m sure it would be preferable to have a workaround that doesn’t involve forcing you to upgrade or wait.

This indeed can’t be solved with multiple GPUs, and it’s not a problem with the acceleration structure, or with your vehicle poly counts or hierarchy depth. The issue here is that certain kinds of updates to the scene information can trigger large buffer copies to the GPU, so what you’re seeing in the CPU load is host to device memory transfer. When some transforms are marked for update in GPU memory, all the transforms in the transform buffer are sent to the GPU, so it’s a problem if you have a large number of transforms. This is also an issue with context variables in large dynamic scenes like yours.

So here are a few ways you might be able to reorganize and fix the problem:

Remove any per-frame context variable updates, put any dynamic data in your own rtBuffer instead. That way you control the frequency and size of variable updates.

This might not work, but you could try putting all your dynamic (moving) objects in a group together, and leave the static objects or put static objects in a separate group. There’s only a minor chance this will resolve the issue, but there is a chance, and this is probably much easier to try than my next suggestions.

One way that is guaranteed to work is to create two OptiX contexts, one for static geometry and one for dynamic geometry. This would mean two separate renders, duplicating some setup, and adding some kind of combination or compositing phase at the end to get a single image out. I know this is a fairly heavy and blunt solution, but this way the small set of dynamic updates won’t drag you down by also updating your large set of static transforms.

Another option might be to roll your own dynamic instancing. By that I mean you could imagine having your own small buffer of transforms for moving objects, and use a pair of custom bounds & intersection programs that reference your own transforms, rather than using the OptiX scene graph. I haven’t thought through this carefully, so this might be a bit more work than using 2 contexts, but that seems like it could work. You would need a way to identify instance IDs, which you might do with a variable on each GeometryInstance. You would need forward instance transforms in your bounds program, and inverse instance transforms in your intersection program.

I hope that helps, and maybe knowing exactly what the issue is will give you some other ideas. Let me know if any of those suggestions don’t make sense, or if they need more explanation.

–
David.

Heiko · June 6, 2019, 7:07am

Hello Jakub,

what exactly do you mean by “The hierarchy depth of the vehicle nodes is around 3-5 since we separate wheels, doors, windows etc”?

Do you build a two-level acceleration structure hierarchy? E.g. one separate accel for the wheels, doors etc, and then a top accel over all of that? Or is your accel hierarchy deeper? This could lead to performance issues.

I assume that for most applications with dynamic objects a two level accel hierarchy is the best solution. The top-level tree should be build over the static as well as all the dynamic objects/parts.

Heiko

dhart · June 7, 2019, 4:06am

Hi Heiko,

I’m quite certain the deep hierarchy is not the cause of the performance problems in Jakub’s case. Your point is interesting, though, so I think it’s worth taking a minute to discuss the performance costs of instancing.

Unlike the current versions of DXR & Vulkan which only allow 2-level hierarchies, OptiX supports a scene graph that can be up to 16 levels deep, because it’s important for high scene complexity.

Using fewer levels is indeed better for traversal performance, but using more levels is also good for reduced memory consumption and reduced startup times. And sometimes needing more than two levels is unavoidable to achieve scene complexities that exceed what games can usually do.

You’re right that there is a traversal cost for extra levels, but the cost might not be as high as you think. Mainly, it’s a couple of matrix-vector multiplies per level.

I measured a test scene with an octtree of teapots in a grid. I built the scene hierarchy in two ways, one by using deep instancing, and one by flattening the scene graph down to 2 levels. On my Maxwell laptop, a 3-level scene of 64 teapots is 8.8% slower than the flattened 2-level scene. When I crank it up to a 6-level scene of 32,768 teapots, the deep hierarchy measures 40% slower to trace than the same scene flattened to a 2-level hierarchy. Note 40% is approximately 8.8% ^ 4. The pattern in this scene does continue predictably from 3 to 7 levels deep, costing just under 9% per additional level of instancing.

Please note that I covered most of the screen with deep instances. A more typical scene might have only a small fraction of the scene covered by the deepest instances, so you would only pay a small fraction of the performance cost for using deep instances. I was also careful with the deep hierarchy to make sure I didn’t have a lot of bounding box overlap. Performance will suffer if your groups of instances have a significant amount of overlap, which is true regardless of how many levels of instancing you use. I used some very simple Phong shading, and no secondary rays. With more complex shading, the relative cost of a deep hierarchy will be even smaller.

Now for the fun part. I tested the same scene all the way up to 15 levels deep. The deep hierarchy’s memory usage stayed almost constant at a few megabytes, and the startup time stayed at about one tenth of a second (including reading the teapot model and building the acceleration structures). The two level scenes, on the other hand, see geometric memory growth and geometric startup times. At 8-levels deep (flattened to 2), the memory usage is 3 gigabytes and the startup time is 1.5 minutes. I stopped testing the flattened scene at 8 levels deep because it was taking too long (and I can see that 9 levels flattened will fit in memory, but 10 levels flattened will run out). But it was easy to keep increasing the levels of deep hierarchy. At 15 levels deep, the test scene has 7.12 * 10^16 “virtual” polygons (71 quadrillion triangles coming from 4.4 * 10^12 “virtual” teapots, where each teapot has ~16k triangles), and the startup time is still ~0.1 seconds with a low but still interactive frame rate!

Also keep in mind that even before you run out of memory with flattened 2-level traversal, high memory usage can impact framerate. At 8 levels deep in my test, the flattened 2-level traversal with ~2M transforms suddenly becomes much slower than the deep 8-level traversal. Once traversal is using that much memory, poor caching and high memory bandwidth of 2-level traversal may dominate.

This is just one anecdotal data point with a contrived scene and a lot of variables, changing any of them can affect the performance profile, so I’m not suggesting that everyone will see 8.8% cost like I did. I am saying that if you need deep instancing, don’t be too afraid to try it. There are some benefits, and we support it for good reasons. If the costs are a concern, then measure all three: trace times, memory usage and startup times, in order to make an informed choice.

–
David.

Jakub_D_Kolesik · June 7, 2019, 9:04am

Hi David, Hi Heiko,
first of all, thank you very much for your help with this issue!

I’m currently updating our OptiX plugin to support OptiX 6 and RTX, so I have the possibility to change a lot on our site, so all your suggestions might flow in my rework.

@ Remove any per-frame context variable updates, put any dynamic data in your own rtBuffer instead. That way you control the frequency and size of variable updates.

This is really valuable information, we have some variables that are updated every frame and putting them into a separate buffer makes sense - to simplify the work OptiX has to do.

@ you could try putting all your dynamic (moving) objects in a group together, and leave the static objects or put static objects in a separate group.

I can try this out next week, could help ;)

@ create two OptiX contexts:

Well, this is an interesting approach. I find it would be hard to combine the results of a separate static and dynamic scene raytracing, and would end up in just an approximation, but it gave me another idea. Since the memory is not an issue right now (P6000 is an option), I might just start 2 instances of the raytracing plugin and let them work on a separate frame in parallel (%2). Of course, this is my last resort since it really waists GPU memory, which we will need at some point.

@ Or is your accel hierarchy deeper?
The hierarchy is a direct copy of the OpenGL hierarchy we have right now (unfortunately), which is ~5 nodes deep. Since I’m in the process of rewriting the code to support OptiX 6, I will flatten the nodes into 1 lvl since there is no actual benefit from the separation - our OpenGL Scene Graph has also problems with it…

@ ~0.1 seconds with a low but still interactive frame rate!

Thank you so much for the comparison tests, I was wondering how much the hierarchy affects OptiX and this is a perfect example I can work with!

Altogether, you gave me a great insight into some of the OptiX features. Could you give me a rough estimation on the release of the OptiX 6 + Driver update you mentioned? Also, the structural update timeline would be interesting :)

What would also be of interest to me is why the “refit” option made the problem even worse. According to the documentation, it should help with dynamic objects instead of decreasing framerates :). Any idea why this is the case?

dhart · June 7, 2019, 3:55pm

Could you give me a rough estimation on the release of the OptiX 6 + Driver update you mentioned?

The fix was just submitted, it now needs to go through testing, packaging, and release with the rest of the driver, so I expect it’ll be maybe a month if nothing goes wrong. We’re planning to try posting some release notes to the forum, and I’ll make a note to reply to this thread when this particular update is released.

Also, the structural update timeline would be interesting :)

I wish I could comment on that, but I can’t. We’re working hard on it!

What would also be of interest to me is why the “refit” option made the problem even worse. According to the documentation, it should help with dynamic objects instead of decreasing framerates :). Any idea why this is the case?

The trade-off with refit is you get a faster rebuild time in exchange for a potentially slower trace time, so it has to be measured and balanced. The refit time for a large acceleration structure is around 10x faster than building an acceleration structure from scratch.

The trace time of a refit acceleration structure depends on how far things moved, and the farther things move, the slower it might go. It’s not guaranteed to slow down but it’s likely. The reason is because refit adjusts all the positions of the boxes, but does not re-compute the hierarchy or re-evaluate the way things are split up. So when instances start moving around, especially if they’re going in different directions, the interior boxes start to overlap a lot. So this is directly related to what I mentioned above: “performance will suffer if your groups of instances have a significant amount of overlap”.

The goal with refit is to allow a lower cost update when things only move a small amount. It can help if rebuilding all of your acceleration structures in a single frame takes too much time. A typical use case of refit in a dynamic environment might be to refit 90% of your acceleration structures every frame, and rebuild 10% of them, then do this in round-robin fashion so that all of your acceleration structures get rebuilt at least once every 10 frames. Obviously, you can adjust the percentages and number of frames to suit your particular use case.

–
David.

dhart · June 17, 2019, 10:22pm

Hey Jakub_D_Kolesik,

I was wrong about this taking a month, it appears the fix to track partial updates to the scene’s list of transforms made it through QA quickly and was released in the latest driver 431.02.

Currently we are tracking a maximum of 4 memory blocks to update each frame. This means you should be able to move 4 objects dynamically without worrying about large buffer copies. But consecutive memory blocks of transforms are joined into a single update. So since you’re moving more than 4 objects, I would recommend putting your moving objects into the scene together, meaning add all the moving objects to the scene back to back so they get consecutive IDs. Then each frame, set the transforms for all of your dynamic objects, even if they don’t move, to make sure there aren’t any holes, so the memory block for transforms updates as one single block copy.

That should help a lot with your frame rate, as long as your set of dynamic objects is small relative to your set of static objects. Let me know if this doesn’t improve the situation.

–
David.

dhart · June 18, 2019, 8:08pm

Oh no, I goofed! This fix was listed as being released, but didn’t actually make it over to the release branch. I apologize if you started working on this on my advice!! It should be out in the next one in a couple of weeks.

–
David.

Jakub_D_Kolesik · June 18, 2019, 8:28pm

Thanks David, a couple of weeks is ok.
I’m looking forward to the fix ;). Is there any information about the maximum of 4 being raised to 10+? I know of some companies expecting 50-100 vehicles on the road. I do not expect such a sudden boost, but would be interesting if it’s on the roadmap :).

dhart · June 18, 2019, 9:01pm

The max of 4 is just a temporary work-around, and I mentioned the coalescing because once you know how it works, you can use that information to cheat to get more than 4 dynamic objects. That’s why it would be good to group all dynamic objects together and see if updating all of them is better than updating ~10 individual objects.

We made the number intentionally very small for now to guarantee that there wasn’t going to be any unintended performance consequences for tracking partial memory updates. The real problem is that we’re trying to track your updates in the first place and getting in your way. If we started tracking hundreds of them, we would certainly add a new performance pit somewhere else. So rather than play whac-a-mole, we decided on a backwards-compatible quick fix that should address the majority of use cases, and later a long term breaking change where you can manage your own updates. A lot of people have run into the same problem you did, and the most common use case is needing to update just 1 transform, in interactive scenarios.

Our larger architectural change allows any number of dynamic objects, it doesn’t have any hard-coded limit. We will provide more information on that when it’s ready, just don’t count on it any time very soon.

–
David.

Jakub_D_Kolesik · June 21, 2019, 10:08am

Hi David,
Thank you very much for your support, you gave me a lot of valuable information. :)

@ Large architectural rework
Sure, no problem. I just hope that a multi-threading approach will be part of the rework since CPU will always be a problem at some point if clamped down to just 1 core & thread :).

Thanks a lot,
Jakub

dhart · June 21, 2019, 4:29pm

I just hope that a multi-threading approach will be part of the rework

Don’t you worry! ;)

–
David.

tavo.16 · July 12, 2019, 10:22pm

Hi dhart,

I have had a problem that I think could be related with this topic.
I’m trying to render a set of spheres using OptiX.
I’m using CUDA 7.0, OptiX 3.9 (I can’t upgrade them since another algorithm depends on that)
NVIDIA Graphics processor Quadro M6000 24G.

The rendering consist in put some spheres in front of the camera and put them in a different position each frame using a uniform distribution (std::uniform_dist) for the three axes (dynamic spheres position).

At the beginning of the simulation everything works ok ( I will attach some images), but some frames after, the spheres begin to appear in a very non-uniform positions (I’m using a volumetric uniform distribution, which looks awesome the first frames). The issue is more evident when the spheres are far away from the sensor ( I was thinking that this was the only cause due the size and the number of rays emitted), but I have improved the positions, now spheres are very close ( I’m using a range from 0.01 to 0.5 meters).

I’m using a geometrygroup to include from 50 to 500 spheres. I’m implementing a rtBuffer that contains the position (x,y,z) and the radius of the elements ( FLOAT4 data type). For the intersection and bounding box program I used the SDK to create them ( I think both work ok). I have included two acceleration structures in the hierarchy ( one for the geometrygroup an the other for the root group) in both, I have tried with Bvh and Trbvh with the same result. Also I’m using markDirty() method after the buffer is filled for both acceleration nodes.

I don’t understand why at the beginning everything works perfectly and then it seems like the intersection algorithms stop working correctly ( it is like the spheres on one half of the scene are not detected), this is why this post is kind of interesting for me. Could you give me some clues about this please? I’ll really appreciate your help.

tavo.16 · July 12, 2019, 10:50pm

As you can see, in the images labeled as “not working” the distribution is not uniform, also the spheres in the right side don’t have good define borders. It is evident that something stopped working properly ( see the working properly images). Although I didn’t mesure the GPU performance, I can notice that computer starts to work slowly ( it is consuming a lot of resources).

droettger · July 15, 2019, 8:12am

The non working version looks as if the acceleration structure wasn’t properly rebuilt.
That there are spheres starting to get box shaped could be because a formerly small bounding box contains a bigger sphere now.

In principle this should not happen as can be seen in these threads for example:
Particle systems using spheres:
[url]https://devtalk.nvidia.com/default/topic/1026659/optix/interactive-geometry-replacement/[/url]
[url]https://devtalk.nvidia.com/default/topic/1027203/optix/refit-of-spheres/post/5224566[/url]
Slightly retaled topics:
[url]https://devtalk.nvidia.com/default/topic/1025339/optix/optix-based-collider-performance/post/5215222[/url]
[url]https://devtalk.nvidia.com/default/topic/899207/optix/graph-organization-of-thousands-of-independent-dynamic-objects/post/4738495[/url]

That your system starts to slow down and consumes a lot of resources might indicate a memory leak.
Note that the OptiX C++ wrappers are not doing OptiX object management. You must call destroy() explicitly when removing objects from the scene, which in your case should not be required.
Everything you need to do is to update the buffer data with the sphere attributes float4 values and call markDirty() on the proper acceleration structure, which are the GeometryGroup containing that Geometry data and all acceleration structures in the scene graph above that GeometryGroup. None of the existing OptiX objects would need to be removed or recreated and added in that case.

It’s not possible to say what is going on without source code.

tavo.16 · July 15, 2019, 2:24pm

Exactly, it is like the rebuild is not working properly,

This is the code that calls the buffers update function each frame:

if(m_setrain)
    	{
    		SetSpherePos(current_pos);
    		m_geometrygroup->getAcceleration()->markDirty();
    		m_group->getAcceleration()->markDirty();
    	}

As you can see I use the markDirty() for geometrygroup and the root group.

And this is the cycle that updates buffers in SetSpherePos() function:

float* pos = reinterpret_cast<float*>(m_buffers.position->map()); // This algorithm creates a volumetric uniform distribution

	for(uint32_t i= 0, idx = 0, clean_aux = 0; i<m_spheresnum; i++)
	{
		pos[clean_aux++]= 0.0f;
		pos[clean_aux++]= 0.0f;
		pos[clean_aux++]= 0.0f;
		pos[clean_aux++]= 0.0f;

		theta =   2.0 * M_PI * space_dist(mt);
		phi   =   acos(1.0 - 2.0 * space_dist(mt));
		r     =   std::cbrt(o_rad(mt));
		x     =   r*sin(phi)*cos(theta);
		y     =   r*sin(phi)*sin(theta);
		z     =   r*cos(phi);

		positions[i].x = current_pos.x + x_offset + x;
		positions[i].y = current_pos.y + y_offset + y;
		positions[i].z = current_pos.z + z_offset + z;
		positions[i].w = rain_size_factor*dist_w(mt);

		pos[idx++]= positions[i].x;
		pos[idx++]= positions[i].y;
		pos[idx++]= positions[i].z;
		pos[idx++]= positions[i].w;

		theta = 0.0f;
		phi   = 0.0f;
		x     = 0.0f;
		y     = 0.0f;
		z     = 0.0f;
	}

	m_buffers.position->unmap();

This is the Geometry setup:

m_sphere->setPrimitiveCount(m_spheresnum);
	m_buffers.position = m_context->createBuffer(RT_BUFFER_INPUT,RT_FORMAT_FLOAT4,m_spheresnum);
	m_sphere->setIntersectionProgram(intersect);
	m_sphere->setBoundingBoxProgram(bounds);

Finally, this is the method that set the geometrygroup and gets the root group:

void OptiXLidarInterface::CreateInstance()
{
	GeometryInstance gi = m_context->createGeometryInstance();
	gi->setMaterialCount(1);
	gi->setGeometry(m_sphere);
	gi->setMaterial(0,m_material);

	//geometry group
	m_geometrygroup = m_context->createGeometryGroup();
	m_geometrygroup->setChildCount(1);
	m_geometrygroup->setChild(0,gi);
	m_geometrygroup->setAcceleration(m_context->createAcceleration("Bvh","Bvh"));

	m_group = vigOptiX::OptiXInterface::getSceneRoot();
	m_group->addChild(m_geometrygroup);

}

In the line where I’m setting the acceleration, also I have been trying with Trbvh. As you can see I’m still using an old configuration ( createAcceleration (const char *builder, const char *traverser) ) because as I already mention, I cannot update OptiX due several dependencies with other modules, Can you see some mistakes in this setup?

I cannot understand why the acceleration rebuilt stops working correctly, considering that once the configuration is done at the beginning I don’t change it anymore and the algorithm works properly a good quantity of frames at the beginning and suddenly fails.

I’m using the bounding box and intersection programs as in the OptiX SDK,
This is the boundig box program:

RT_PROGRAM void Bounds(int primIdx, float result[6])
{
    const float3 cen = make_float3(spherepos[primIdx]);
    const float3 rad = make_float3(spherepos[primIdx].w);
    optix::Aabb* aabb = (optix::Aabb*)result;

	if(rad.x > 0.0f && !isinf(rad.x))
	{
    	aabb->m_min = cen - rad;
    	aabb->m_max = cen + rad;
	}
	else
	{
    	aabb->invalidate();
	}

}

droettger · July 15, 2019, 3:08pm

This looks all normal, just the code defining the buffer with the spheres on the Geometry is missing in the given code excerpts.

If there is a memory leak involved I don’t see it in the given code.You could try to run the nvidia-smi command I posted here
[url]https://devtalk.nvidia.com/default/topic/1057063/optix/animations-in-optix-6-previously-performed-using-selectors-/post/5361054/#5361054[/url]
in parallel to see of the amount of VRAM used increases when your program starts getting sluggish.

I’m assuming the SetSpherePos() is debug code?

There is no need for the pos[clean_aux++] initializations. You overwrite that same values with pos[idx++] = positions[i].x; unconditionally.
Similarly for the reset of the theta, phi, x,y, z values at the end of the loop.
There would also be no need for the scalar assignments when casting the mapped pointer to float4.
Is the space_dist(mt) calculating the same value twice?
Is that type double? That would make the following trigonometric functions rather expensive.

dhart · July 15, 2019, 4:02pm

Some of what you wrote also made me wonder if your random number generator isn’t giving you a good uniform distribution. Are you using the random number generator from our SDK, or something else? It might help to see how you generate the numbers, and everything happens between there and assigning sphere positions and radii. In your code above, the radius is tested for negative or infinity; when, why and how often does that happen? Is that only there for completeness and safety and not normally needed, or will your program stop working correctly if you remove that test?

–
David.

tavo.16 · July 15, 2019, 8:03pm

Detlef:
Yes, the code I posted was for debug porpuses, this is the current code:

float4 positions[m_spheresnum];
	uint64_t seed = std::random_device{}() | std::chrono::system_clock::now().time_since_epoch().count();

	std::mt19937 mt(seed);
	std::gamma_distribution<float> dist_w(alpha,beta);
	std::uniform_real_distribution<float> space_dist(0.0,1.0);
	std::uniform_real_distribution<float> o_rad(0.0,8.0f); // origin-radius (2.0 meters max, cubic root of (8.0 meters))

	float* pos = reinterpret_cast<float*>(m_buffers.position->map());

	for(uint32_t i= 0, idx = 0; i<m_spheresnum; i++)
	{

		theta =   2.0 * M_PI * space_dist(mt);
		phi   =   acos(1.0 - 2.0 * space_dist(mt));
		r     =   std::cbrt(o_rad(mt));
		x     =   r*sin(phi)*cos(theta);
		y     =   r*sin(phi)*sin(theta);
		z     =   r*cos(phi);

		pos[idx++]= current_pos.x + x_offset + x;
		pos[idx++]= current_pos.y + y_offset + y;
		pos[idx++]= current_pos.z + z_offset + z;
		pos[idx++]= rain_size_factor*dist_w(mt);

	}

	m_buffers.position->unmap();
}

I know that space_dist is used twice, I use it to calculate two different values(theta & phi).

Data type of uniform distributions is float.

I’m measuring the gpu performance with the nvidia smi, these are the results when the simulation is working properly:

019/07/15 14:25:45.863, Quadro M6000 24GB, P0, 24472 MiB, 3719 MiB, 3 %, 11 %
2019/07/15 14:25:46.864, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 3 %, 12 %
2019/07/15 14:25:47.865, Quadro M6000 24GB, P0, 24472 MiB, 3728 MiB, 3 %, 17 %
2019/07/15 14:25:48.866, Quadro M6000 24GB, P0, 24472 MiB, 3728 MiB, 1 %, 3 %
2019/07/15 14:25:49.867, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 3 %, 12 %
2019/07/15 14:25:50.868, Quadro M6000 24GB, P0, 24472 MiB, 3719 MiB, 3 %, 14 %
2019/07/15 14:25:51.869, Quadro M6000 24GB, P0, 24472 MiB, 3719 MiB, 2 %, 12 %
2019/07/15 14:25:52.870, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 1 %, 4 %
2019/07/15 14:25:53.872, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 2 %, 12 %
2019/07/15 14:25:54.873, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 2 %, 12 %
2019/07/15 14:25:55.874, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 2 %, 11 %

2019/07/15 14:26:21.903, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 8 %, 45 %
2019/07/15 14:26:22.904, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 8 %, 47 %
2019/07/15 14:26:23.906, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 8 %, 34 %
2019/07/15 14:26:24.907, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 5 %, 21 %
2019/07/15 14:26:25.908, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 7 %, 25 %
2019/07/15 14:26:26.909, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 4 %, 20 %
2019/07/15 14:26:27.910, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 6 %, 26 %
2019/07/15 14:26:28.911, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 2 %, 16 %
2019/07/15 14:26:29.913, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 5 %, 25 %
2019/07/15 14:26:30.914, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 1 %, 7 %
2019/07/15 14:26:31.927, Quadro M6000 24GB, P0, 24472 MiB, 3710 MiB, 5 %, 16 %

And these are the results when the issue is present:

2019/07/15 14:21:19.530, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 2 %, 7 %
2019/07/15 14:21:20.531, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 45 %
019/07/15 14:21:58.575, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 44 %
2019/07/15 14:21:59.576, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 46 %
2019/07/15 14:22:00.578, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 2 %, 25 %
2019/07/15 14:22:01.579, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 5 %, 15 %
2019/07/15 14:22:02.583, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 44 %
2019/07/15 14:22:03.585, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 45 %
2019/07/15 14:22:04.586, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 2 %, 21 %


019/07/15 14:22:45.631, Quadro M6000 24GB, P0, 24472 MiB, 3720 MiB, 11 %, 60 %
2019/07/15 14:22:46.632, Quadro M6000 24GB, P0, 24472 MiB, 3720 MiB, 11 %, 52 %
2019/07/15 14:22:47.633, Quadro M6000 24GB, P0, 24472 MiB, 3720 MiB, 10 %, 50 %
2019/07/15 14:22:48.635, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 10 %, 53 %
2019/07/15 14:22:49.636, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 50 %
2019/07/15 14:22:50.637, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 53 %
2019/07/15 14:22:51.638, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 53 %
2019/07/15 14:22:52.639, Quadro M6000 24GB, P0, 24472 MiB, 3711 MiB, 9 %, 48 %

As you can see, the gpu shows a considerable increase in work load when the issue appears, what do you think about this measures? Again, this occurs when a lot of frames have passed, when it starts, some frames looks great but the issue comes every two or three frames for the rest of the simulation.

tavo.16 · July 15, 2019, 8:14pm

Hi David,

You can see the random numbers generation in the past comment I did. I’m using std::chrono to generate the seed and std::mt19937.

I took all the data of several frames where the issue was present and I plotted each plane (xy, xz, yz) and the distribution looks fine, the issue seems to be inside OptiX.

Yes, I already used the radius>0 and radius<inf, the result is the same.

The issue occurs in a the same place in the scenario but it never follows the same pattern in other scenarios. I need to say that the complexity of the scenarios where I’m testing is almost the same during all the simulation ( I mean that I don’t put suddenly a lot of cars or pedestrians and the issue appears, it can happen in an empty road for example)