Excessive setup times

Hi all,

I’m encountering excessive setup times when baking per vertex AO.
Basically I have a camera set to read from a vertex and normal buffer to get the positions of each vertex.
The output is a 1D buffer with an entry per vertex.

Setup time is measured by setting up the pipeline and then calling launch with a 0 size output.

I am seeing an expected large setup time (187s) for the first render.
However, a subsequent render is taking 42ms to set up.

Here’s the output:

Dummy launch took 187.112717s
Launch took 0.118060s for 270 vertices

Dummy launch took 41.918648s
Launch took 0.095409s for 270 vertices

Notice that the actual render is really fast, and the reason I’m using Optix for this instead of a “cube map per vertex” style AO render, but those setup times are killing me.

Here’s the code:

void StaticLightOptix::RenderVertexAO(CMeshRenderEntity* pMeshInst)
{

	FlushD3D();

	// grab the mesh ptr
	CMesh const * pMesh = pMeshInst->GetMeshPtr(0);
	uint32 numV = pMesh->GetMeshBuffers().mVertexCount;
	uint32 totalV = pMeshInst->CountTotalVertices();

	// create and bind an output buffer
	optix::Buffer buffer = mOptixContext->createBuffer(RT_BUFFER_OUTPUT, RT_FORMAT_FLOAT, totalV);
	mOptixContext["outAOBuffer"]->setBuffer(buffer);

	{
		// bind mesh parameters
		optix::Geometry geo = AddMesh(pMesh);

		mOptixContext["vertex_buffer"]->setBuffer(geo["vertex_buffer"]->getBuffer());
		mOptixContext["normal_buffer"]->setBuffer(geo["normal_buffer"]->getBuffer());
	}

	{
		// setup parameters
		mOptixContext["sqrt_occlusion_samples"]->setInt(6);
		mOptixContext["occlusion_distance"]->setFloat(4.f);

		// grab the transform for this mesh
		const CMatrix34* pTransform = pMeshInst->GetTransformPtr();
		mOptixContext["matrix_row_0"]->setFloat(pTransform->Get00(), pTransform->Get10(), pTransform->Get20(), pTransform->Get30());
		mOptixContext["matrix_row_1"]->setFloat(pTransform->Get01(), pTransform->Get11(), pTransform->Get21(), pTransform->Get31());
		mOptixContext["matrix_row_2"]->setFloat(pTransform->Get02(), pTransform->Get12(), pTransform->Get22(), pTransform->Get32());
	}

	try
	{
		{
			bpe::StopWatch elapsedTime;
			// dummy launch to set shit up
			mOptixContext->launch(kCamProg_VertexRelative, 0);

			char buffer[256];
			sprintf(buffer, "Dummy launch took %fs\r\n", elapsedTime.GetElapsedSeconds());
			OutputDebugStringA(buffer);
		}


		{
			bpe::StopWatch elapsedTime;
			mOptixContext->launch(kCamProg_VertexRelative, numV);

			char buffer[256];
			sprintf(buffer, "Launch took %fs for %d vertices\r\n", elapsedTime.GetElapsedSeconds(), numV);
			OutputDebugStringA(buffer);
		}

		// now copy the output buffer into the mesh AO buffer
		{
			float* pData = (float*)buffer->map();

			// invert!!
			for (uint32 i = 0; i < totalV; i++)
			{
				pData[i] = 1.0f - pData[i];
			}
			pMeshInst->CreatePerVertexData(EMeshInstanceVertexDataType::kBakedAO, CBuffer::EFormat::Float, totalV, pData);
			buffer->unmap();
		}

	}
	catch (optix::Exception& e)
	{
		CheckError(e.getErrorCode());
	}

	// unbind stuff
	mOptixContext->removeVariable(mOptixContext["vertex_buffer"]);
	mOptixContext->removeVariable(mOptixContext["normal_buffer"]);
	mOptixContext->removeVariable(mOptixContext["outAOBuffer"]);
}

Note the FlushD3D function merely ensures that the D3D immediate context is flushed.
I call RenderVertexAO sequentially in a loop with no other functions between calls.

Any help would be appreciated.

– Martin

I got the between instance render down to ~9s by not removing the variables.
Still way too long.

Also I create a single outAOBuffer that’s large enough for the largest mesh and assign it once, but that didn’t seem to make a difference.

I’m wondering if the silly startup times are due to having the debugger attached.
I noticed that the samples take much longer to start when the debugger is attached, even in “release” mode.

– Martin

Further information:

I got a run of instances of the same mesh all in a row, and the dummy launch between these meshes was ~0.02s.
When the run ended and a different mesh was up again, the dummy launch rose to the ~9s mark again.

Looks like something nasty is going on if you change a buffer in the context?

– Martin

First, adding or removing variables triggers a recompile of the mega-kernel. It’s basically a code change. Means OptiX recompiles the PTX code which is an expensive operation.

You might also want to check OptiX 3.9.0 which is the stable release.
OptiX 4.0 beta 1 is a complete rewrite of the core and has a different runtime behaviour at this time.

Some questions about your StaticLightOptix::RenderVertexAO() function:

How often are you calling RenderVertexAO() in your application?

If RenderVertexAO() is called more than once, you’re leaking OptiX objects!
Note that the OptiX C++ wrappers do not generate a fully reference counted shadow graph. They are just providing the C-API as C++ objects. The memory management still needs to be done explicitly! You need to call destroy() on created objects if they should be freed.
In your case buffer and geo for example. Remove variable is not doing that either.

What exactly happens in AddMesh(pMesh)?

The inversion pData[i] = 1.0f - pData[i]; is a waste of time. (Even more so in the debugger.)
This code is running sequentially on the CPU. You should do this on the GPU to get this parallel, either when creating the data in OptiX (recommended) or when reading it in HLSL.

That would also get rid of your illegal write access of buffer data on the host! Your buffer is defined as RT_BUFFER_OUTPUT. You should not access it on the host for writing, only for reading!

Also if that buffer resides on the GPU accessing it is a potential slow operation (data access via the PCI-E bus).
For these things there are interop buffers, you should use OptiX 3.9.0 if you need D3D interop and add your use case to the post https://devtalk.nvidia.com/default/topic/929357/optix/optix-4-0-beta-and-directx-interop/

I assume pMeshInst->CreatePerVertexData() is making a copy of the pData? (I’m not using D3D myself.)
The pData pointer is invalid after the buffer->unmap().

Now the other performance impacting case:
If you change the geometry inside the scene, the acceleration structures (AS) need to be rebuilt.
Depending on the used AS builder and complexity of the geometry, this can be a more or less expensive operation.
What is the complexity of your scene?
Normally Trbvh is very fast for building and rendering.

When reporting problems please always report the following system information:
OS version, installed GPU(s), display driver version, OptiX version, CUDA toolkit version used to compile PTX code for OptiX.
This reduces turnaround times and allows for more precise recommendations.

Hi Detlef,

Good to know about the leaks… should be simple to clean those up.

It seems that the majority of the delay comes from having the debugger attached.
If it is attached then any non-trivial change to the system takes ~8s.
For instance, a simple shader to clear the accumulation buffer takes 8s, then the first render takes 8s, then subsequent renders with simple changes take basically nothing, then the shader to read the result takes 8s.

IDK if this is a known issue, or something that is affected by our setup (we run from a C# front end via a managed C++ wrapper to the game level code).

– Martin

“For instance, a simple shader to clear the accumulation buffer takes 8s, then the first render takes 8s, then subsequent renders with simple changes take basically nothing, then the shader to read the result takes 8s.”

Not sure what you mean with the different shaders, but if you exchange ray generation programs for example, that will of course trigger a kernel recompile.
The fast method for that case would be to have multiple entry points with these different ray generation programs defined at once and then picking which to use with the first parameter in the OptiX context->launch() call. That way the mega-kernel doesn’t change and switching entry points has no additional cost.

If you think this is due to the attached debugger, how does it perform as release build without debugger attached?

I have no recent experience with C#. Last time I debugged an OpenGL wrapper using it, it was completely CPU bound by copying image data back and forth needlessly. That said, I could believe if managed code in a debugger would be part of your problem.

In any case benchmarks should always be run in release mode without debugger attached.