OptiX performance loss due to RTX execution strategy?

Hi,
I’ve ran into the same trouble as mentioned in this topic.https://devtalk.nvidia.com/default/topic/1052340/optix/optix-6-0-0-performance-loss-/post/5356330/

The situation is similar. But I try running more examples. Accounting for different number of samples used in various version, PathTracer sample in SDK in optix6.5/7.0 is 2x slower than optix3.9/5.1.0.

For the ocean sample in optix_advanced_samples, I got ~25fps built with OptiX 5.1 (CUDA 9.0) and only ~0.5fps built with OptiX 6.5 (CUDA 10.1).

I also noticed the simliar phenomenon that (as mentioned in the topic), in GPU performance tab in Task Manager, “Copy” is near 100% in OptiX 5.1.0/3.9 but 0% in OptiX 6.5/7.0.

I used performance analysis and found out something perhaps related (However I don’t have 10 series graphics card or above to support next-gen CUDA debugging):
cuMemcpyDtoHAsync_v2 takes most of time in OptiX 5.1 (while cuEventSynchronize spans a long time in OptiX 6.5).
I can also see Megakernel_CUDA_x (x goes from 0 to 4) taking quite a few time in OptiX 5.1.

So I tried disable the default RTX execution strategy in OptiX 6.5 using:

#define RT_CHECK_ERROR_NO_CONTEXT( func ) \
  do { \
    RTresult code = func; \
    if (code != RT_SUCCESS) \
      std::cerr << "ERROR: Function " << #func << std::endl; \
  } while (0)

int enableRTX = 0;
RT_CHECK_ERROR_NO_CONTEXT(rtGlobalSetAttribute(RT_GLOBAL_ATTRIBUTE_ENABLE_RTX, sizeof(enableRTX), &enableRTX));

And it turns out the ocean sample running at 25fps (as fast as in OptiX 5.1!). I supposed an issue could be there.

By the way, I can’t find a way to attach some images or files (such as .nvreport) in the thread. I wonder if that could help.

Environment:
Windows 10, x64
Windows 10 SDK:10.0.18362.0
Nvidia GTX 960M
Driver version: 436.02
(CUDA and OptiX versions mentioned above)

Welcome @hearwindsaying,

The optixOcean sample from the OptiX advanced samples is simply out of date code. We’re in the process of updating for OptiX 7. The structure of this code is written in a way that causes the program to fall off the performance cliff with today’s OptiX SDK & driver. It’s not a problem with OptiX, it’s a problem with this particular application. If we were to re-write it today, we would use a triangle mesh and BVH rebuilds instead of the ray-marching approach. As you can see in the thread you linked to, @mahonyyy was able to recover most of the performance by rewriting the intersection loop.

As far as the optixPathTracer goes, you need to compare the exact same application code. The OptiX SDK samples are not benchmarks, they can and do change over different releases. From your description, I can’t tell what versions you’re testing. At some point, we modified the samples per pixel in the optixPathTracer, as well as the window size. You might be measuring two very different workloads.

In order to compare apples to apples, you’ll need to compile a single version of an app like optixPathTracer, and be prepared to find and copy the version of the optix & rtcore DLLs you want to test so they’re next to your application binary. You can verify the binary is getting the right DLLs using a utility like Microsoft’s Process Explorer.


David.

Hi dhart,

Thanks for your quick reply and explanation.

It seems that some precompiled samples in OptiX 6.0 or above may not be optimal, such as not disabling any-hit when appropriate (I’ve done so in my renderer).

For the optixPathTracer, I have modified sqrt_num_samples=2, width=512 and height=512 both in OptiX5.1.0 and OptiX6.5.0
I also checked with Process Explorer to ensure they match corresponding dlls. I’ll look into the code to ensure that I didn’t change some parts of code by accident.

Here are some more results running from precompiled samples directly:

(same launch width and height)
sample/version/fps
optixParticles/5.1/~170
optixParticles/6.5/~150
optixWhitted/5.1/~35
optixWhitted/6.5/~30
optixWhitted/7.0/~30 (built using CUDA10.1)

Okay, I know that they’re not for benchmark and I’m not here to complain (actually I get benefits due to RTX hardware acceleration after upgrading my another project to OptiX 6.5).

Thanks!

I just found that pathtracer sample in OptiX 5.1 Russian Roulette is used and minimum depth is set to 1 initially while in OptiX 7.0 maximum depth is set to 3 and no Russian Roulette is employed. That makes a difference. I’ll look into the code for more details.

If you want to compare perf of the SDKs, I’d recommend not even trying to compare the two versions of optixPathTracer code. Pick one of them and compile it against both SDKs in order to try to compare performance. There are lots of small details that affect performance, so it’s best to ensure that your application code is 100% identical. Even better, compile both versions of multiple samples against both APIs to get a better broad sense of the overall differences. The optixOcean sample isn’t the only sample that has an older structure and would benefit from a rewrite for OptiX 6/7.

It would also be a good idea to consider using an RTX GPU if at all possible, I hope it goes without saying that the best performance you can get on OptiX 6/7 is when using RTX hardware and that our current and future efforts are primarily focused on how to best utilize the RTCores in hardware.


David.

Well, I tried compiling optixPathTracer using OptiX 5.1(with cuda 9.0) and OptiX 6.5(with cuda 10.1) respectively.

In createContext() function I removed “context->setMaxTraceDepth( 2 );” to make it compile successfully in OptiX 5.1.

context = Context::create();
    context->setRayTypeCount( 2 );
    context->setEntryPointCount( 1 );
    context->setStackSize( 1800 );
    //context->setMaxTraceDepth( 2 );

I got ~45 fps in OptiX 5.1 version while ~28 fps in OptiX 6.5 version. They should have the same code now.

However, when I disable RTX execution in OptiX 6.5 version by:

#define RT_CHECK_ERROR_NO_CONTEXT( func ) \
  do { \
    RTresult code = func; \
    if (code != RT_SUCCESS) \
      std::cerr << "ERROR: Function " << #func << std::endl; \
  } while (0)

int enableRTX = 0;
RT_CHECK_ERROR_NO_CONTEXT(rtGlobalSetAttribute(RT_GLOBAL_ATTRIBUTE_ENABLE_RTX, sizeof(enableRTX), &enableRTX));

It runs in ~45 fps now in OptiX 6.5 version. I wonder whether that makes sense.