Performance hit whenever 'any hit' is used

I am relatively new to OptiX, and thus far I have been able to implement a basic ray tracer. One thing I can’t understand is why there is such a huge performance hit when ‘any hit’ is used. When I have only one ray type, which uses only a ‘closest hit’ program, everything runs great and I can render my Stanford bunny almost instantaneously. However, if I do anything related to an any hit program, I get about a 100x slowdown. For example, let’s say the only change that I make is to change

rtMaterialSetClosestHitProgram( *material, 0u, closest_hit_program );

to

rtMaterialSetAnyHitProgram( *material, 0u, any_hit_program );

where the any hit program is

RT_PROGRAM void any_hit()
{
  rtTerminateRay();
}

and of course add the associated program ptx file etc. Just doing that slows my program down about 100x (and obviously it doesn’t render anything because the hit program is empty).

I’ve found that the same thing happens when I do the following: add two ray types instead of one

rtContextSetRayTypeCount( context, 2 )

then add only a closest hit program for ray type 0, and only an any hit program for ray type 1

rtMaterialSetClosestHitProgram( *material, 0u, closest_hit_program );
rtMaterialSetAnyHitProgram( *material, 1u, any_hit_program );

but then only launch rays of type 0. So again, no rays are being launched that have an any hit program (type 1). The only difference is that there is a second ray type with an any hit program that is seemingly unused. This still results in the same ~100x slowdown.

Is there something I’m missing with the any hit programs? It seems like given the example above, using the any hit program should be faster since it should be killing the ray as soon as it intersects something.

Hard to say with the given information.

Please always specify the following system information when reporting problems to allow to verify this on a matching system:
OS version, installed GPU(s), display driver version, OptiX version, CUDA Toolkit version you used to compile your PTX programs.
Performance issues require absolute numbers and a description of how to reproduce your results.

Is that release or debug?
Are there any rtPrintf or exceptions when running the app?
Enable all exceptions to see that.
If not, disable all exceptions and rtPrintf when benchmarking.

Additional test:
What is the performance you see when running some pre-compiled OptiX SDK examples like the simple Whitted style manta_scene example?
I think key ‘r’ does continuous rendering and ‘b’ does benchmarking.

Are you measuring with vsync turned off in the NVIDIA Display Control Panel?