OptiX, OptiX Prime, Compatibility with CPU and RTX

Hi,

I’m working on a new test renderer, and would like to support CPU raytracing via Embree and GPU raytracing via OptiX or OptiX Prime. I’d like some advice on which GPU API to select. In-particular, I have three main concerns:

  • I do not want to maintain multiple code paths. A bit of conditional compilation is of-course okay, but the bulk of the renderer should only be written once, yet support both CPU and GPU.
  • The compilation model should be fairly simple. I am displeased with the way OptiX programs apparently must load their own code as compiled PTX explicitly at runtime, and can only be compiled at all using enormous CMake scripts which we're just supposed to modify. It seems like OptiX Prime works with the simpler compilation model as CUDA (write, compile/link, done)?
  • The GPU side should support NVIDIA's emerging RTX technology. While OptiX is one of the three listed ways to access RTX (the other two being a notional Vulkan API to come, or DXR), it's unclear to me whether OptiX Prime is an option here.
  • So overall, if OptiX Prime will also support RTX, it seems like I should select that? If so, can one clarify the complexity difference in the two APIs? I don’t mind getting my hands dirty with ray packets and such, but I’d like to know what I’m getting into, including any performance pitfalls I might not run into with OptiX, yet I haven’t found any simple or up-to-date OptiX Prime samples at all.

    Ian

    “the renderer should only be written once, yet support both CPU and GPU.”
    OptiX is a pure GPU renderer (requiring CUDA). OptiX Prime has a fallback to CPU
    There is no CPU fallback for denoising. (see OptiX_Programming_Guide_5.1.0 page 75)

    “am displeased with the way OptiX programs apparently must load their own code as compiled PTX explicitly at runtime, and can only be compiled at all using enormous CMake scripts which we’re just supposed to modify”
    you can compile them without the CMAKE scripts. PTX files are compiled with NVCC or JIT compiled by NVRTC
    Look into the “OptiX Introduction Samples”
    [url]https://github.com/nvpro-samples/optix_advanced_samples/tree/master/src/optixIntroduction[/url]
    When you setup the project with CMAKE than later in VS2017 they use the “Custom Build Tool”.

    “The GPU side should support NVIDIA’s emerging RTX technology.”
    For RTX you need a Volta-GPU or higher.
    [url]https://devblogs.nvidia.com/introduction-nvidia-rtx-directx-ray-tracing/[/url]
    [url]https://developer.nvidia.com/rtx[/url]
    There is a D3D12 Raytracing Fallback Layer (see [url]https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Libraries/D3D12RaytracingFallback[/url]) which emulates the DirectX Raytracing API on devices without native driver/hardware support.

    “, yet I haven’t found any simple or up-to-date OptiX Prime samples at all.”
    There are OptiX Prime samples in OptiX 5.1 SDK: [url]https://developer.nvidia.com/designworks/optix/download[/url]
    primeSimple, primeSimplePP, primeInstancing, primeMasking, primeMultiBuffering, primeMultiGpu
    There are of course also simple samples for OptiX: optixHello, optixSphere, …
    And there is an excellent tutorial (also with some very simple examples and some advanced ones) by Detlef Roettger for OptiX 5.1 [url]https://github.com/nvpro-samples/optix_advanced_samples/tree/master/src/optixIntroduction[/url]

    “can one clarify the complexity difference in the two APIs?”
    Optix Prime is a fast low-level API for ray tracing.
    OptiX requires a CUDA-capable GPU and the CUDA toolkit. Look at the simple samples of the OptiX 5.1 SDK

    The point is that, in-keeping with my first concern, that there should be only one code path. OptiX Prime having a fallback to the CPU is nice, but we’ll probably use Embree for that instead.

    Yes; the OptiX Introduction Samples ultimately depend on all the scripts in https://github.com/nvpro-samples/optix_advanced_samples/tree/master/src/CMake, which is . . . nontrivial. But what I’m asking is whether the OptiX Prime compilation model is simpler? It seems to be: OptiX has compile stage, then load PTX at runtime stage, whereas OptiX Prime is just compile—but I’m not sure I’ve surmised that correctly.

    My question was whether OptiX Prime will be accelerated by NVIDIA RTX, in the same way that OptiX is. Currently, only OptiX, Microsoft DXR, and Vulkan are listed as the ways to get RTX acceleration. My question is whether OptiX Prime counts under the umbrella of OptiX for this purpose.

    Yes; okay that’s pretty obvious. Sorry for not checking that first.

    “But what I’m asking is whether the OptiX Prime compilation model is simpler? It seems to be: OptiX has compile stage, then load PTX at runtime stage, whereas OptiX Prime is just compile—but I’m not sure I’ve surmised that correctly.”

    OptiX Prime uses pure .cu CUDA kernels, which can be compiled to .OBJ files by NVCC (using nvcc.exe --compile …)
    Those then simply can be linked to the project as you do with any other CUDA kernel.

    For the OptiX API this is not possible, but with the Custom Build Tool you can generate the PTX code for the OptiX API as strings in .h files (C++), which can then directly included in the C++ module without the external .PTX files.
    The OptiX API needs PTX source code as input. See https://devtalk.nvidia.com/default/topic/1027289/optix/creating-an-rtprogram-from-cuda-obj-file-/
    As I said before there is a JIT compiler NVRTC which can compile the .cu files at application run-time. This option you loose, when you use the strings of the PTX code as .h include file.
    When you look into the sampleConfig.h file of the OptiX 5.1 SDK you can set CUDA_NVRTC_ENABLED according to your needs.

    So when you don’t need a compilation at run-time you can link the results in both cases into the module. But in the case of OptiX API only as PTX Code Strings.

    And OptiX Prime only can process triangles as primitives.

    from the OptiX_Programming_Guide_5.1.0 page 93:
    […]Sometimes the algorithm as a whole does not benefit from this tight coupling of user code
    and ray tracing code, and only the ray tracing functionality is needed. Visibility, trivial ray
    casting rendering, and ray tracing very large batches of rays in phases may have this
    property. OptiX Prime is a set of OptiX APIs designed for these use cases. Prime is specialized
    to deliver high performance for intersecting a set of rays against a set of triangles. Prime is a
    thinner, simpler API, since programmable operations, such as shading, are excluded. Prime is
    also suitable for some quick experimentation and hobby projects.
    […]

    “Yes; the OptiX Introduction Samples ultimately depend on all the scripts in https://github.com/nvpro-samples/optix_advanced_samples/tree/master/src/CMake, which is . . . nontrivial”
    but CMAKE and VS2017 do that for you.

    “My question is whether OptiX Prime counts under the umbrella of OptiX for this purpose.”
    Sorry, I just don’t know.

    “My question is whether OptiX Prime counts under the umbrella of OptiX for this purpose.”
    No. We have no plans to support RTX acceleration with OptiX Prime.

    So Optix Prime is no longer supported and effectively a deprecated API regarding RTX and new features?

    @Ankit_Patel:
    I would very much like an answer to this as well. Information has been scarce, and it is hard to prepare for this new platform. To prepare for RTX, I preordered a device, waited for CUDA 10, then changed from driver API coding to runtine API coding to be able to use OptiX, and now I read that OptiX Prime is not going to support RTX? Some support would be very welcome!

    @jbikker: to summarize:

    OptiX does support RTX.
    OptiX Prime does not.

    AFAIK, OptiX Prime basically just does ray intersection; there is no facility to generate more rays from hit points. Talking to some other folks from NVIDIA out-of-band, it was claimed that adding RTX acceleration to OptiX Prime would therefore be pointless, since without the ability to generate rays on-chip, the system is limited by memory bandwidth.

    Ian: thanks for your reply.
    However, I don’t see how Prime would limit RTX, on the contrary actually. The way I use Prime, I feed it buffers of rays which I generate on the device, and these never leave the device. Hits generated by Prime also stay on the device and become input for another CUDA kernel.
    RTX can only work with triangles, so the generic OptiX can at best only partially benefit from hardware BVH traversal. I understand that 5.2 will have a modified API to support the triangle intersection hardware, but Prime never needed this: it is already triangle-only.

    Anyway, I suppose I was mostly venting my grievance with the lack of information for developers. I feel like I’m stumbling in the dark, and have been for months now.

    The buffers might not leave the GPU, but they still live in GDDR somewhere. So during trace, rays are being streamed in and out of (graphics) memory.

    The claim is that this ray traffic already consumes your memory bandwidth, so there’s no point in adding RTX to the mix for faster intersection. 10 GRay/s at say 40 B/Ray will eat 400 GB/s of perfectly streamed ray data (ref. 448 GB/s and 616 GB/s on the 2080 and 2080 Ti, respectively). Of course, the ray data won’t be perfectly streamed, and even if it were, there’s still the scene data to load (i.e. the BVH nodes for each ray).

    One can mess with the assumptions: 40 B/Ray could be compressed. Scene data and ray data can be streamed coherently and simultaneously (at an algorithmic cost). But, at the very least it seems clear that a lot of your bandwidth is eaten by ray data, so you won’t get max perf with OptiX Prime.

    By contrast, with OptiX, the rays can be processed as they are generated, so you needn’t go to memory (or at least as much).

    Most of the cost of traversal is in memory stalls waiting for BVH nodes to load from (G)DDR. The speculation in our group is that they’re using a compressed BVH with treelets, but triangle costs are still going to be relatively minor.

    Yeah, I’m with you. If NVIDIA were more-forthcoming with the details of this system, we wouldn’t have to speculate, and we would probably be able to write better, faster code.

    Honestly, I get the impression they’re not so forthcoming because they don’t know… this all seems to be very experimental to me right now.

    This is an interesting discussion and I’m sorry we are not able to share more information about the upcoming SDK publicly.

    I would recommend this talk for some information on the triangle API.
    http://on-demand.gputechconf.com/siggraph/2018/video/sig1812-2-oliver-klehm-high-performance-optix.html

    However, I will say that you are correct that it is not possible to reach 10 Grays/sec with OptiX Prime, so yes, to take advantage of the RTCores you must use the OptiX SDK.

    The team has been working very hard on updating the OptiX SDK to fully take advantage of RTX and I’m really anxious to share it with you. Unfortunately we have to wait a little longer before it is ready for public posting.

    Thank you for your reply. I checked out the video you suggested, and that is indeed quite informative. Looking forward to optimizing for this platform. :)

    One more thing: NV highlighted that RTX can achieve “10+ Giga Rays/Sec using RT Cores”.

    The diagram clearly says that it is measured with primary rays only. So than, again, why is that Optix Prime can’t generate the holy 10+ Giga Rays, if the metric was give with eye rays only, no ray-tree was used. I’m just curious. :-)

    @Ankit_Patel:

    I have some questions which I hope you will be at liberty to answer.

    DAZ Studio application which I am using is built using Iray SDK.

    This Iray SDK is using OptiX Prime.

    With what you stated about OptiX Prime not being able to use RT hardware, that means people using DAZ Studio won’t have any benefits from purchasing RTX capable hardware.

    This is now a really confusing situation so someone needs to clarify things for consumers and I think only NVIDIA is in position to issue such clarification.

    My questions:

    1. After new OptiX SDK with RTX support is released will NVIDIA update Iray SDK to use OptiX instead of OptiX Prime, or is that vendor’s choice and/or responsibility to do?

    2. Is there any ETA when Iray SDK will be updated to take advantage of our $1,200 RTX 2080 Ti paperweights?

    Please advise.

    @dhart

    David, are you perhaps able to answer the 2 questions from my previous post here?

    I have been playing around with the new Optix 6.0 today. Some findings:

    • Optix Prime is apparently not deprecated after all… It is not only still available, it also got faster on RTX hardware (~20-40%) and slower on pre-RTX (~5-10%). So, it appears to be using the hardware traversal units, despite the claims in this thread.

    • The new triangle geometry is quite easy to use, and very fast on RTX hardware (~560Mrays/s including shading, which is probably about 1G rays just for the ray tracing stages).

    • The documentation is not up-to-date, sadly. Functions are missing and/or have different names, macros apparently got renamed, function arguments changed. This happens for the new triangle geometry, so that’s quite unfortunate. Especially the default attribute program could use some explanation. Luckily there are header files, and one example.

    • It looks like having multiple entry points no longer work.

    • Jacco.

    Hi Jacco,

    Thank you for the updates!

    So, OptiX Prime is still there for now, yes, but unfortunately it’s definitely not using the RTX hardware. There are a bunch of reasons that Prime in OptiX 6 can go faster on your system.

    Depending on your hardware, it should be easy to well exceed 1G rays/s for traversal and triangle geometry. If you want to discuss the details of your code and get suggestions for optimizing, we’re happy to help. (I recommend starting a new topic or email optix-help.)

    We do still have some issues in the documentation, it’s true. Apologies! We’re trying to sort them out as quickly as possible. If you are willing to share any specific mistakes you ran into, I would definitely appreciate it and pass them on to the right people.

    Can you elaborate on your multiple entry points issue? Are you testing a sample or some other code? What are your symptoms (bad output, errors, crashing, etc.) What hardware & OS are you testing with?


    David.

    I’ve had no problem using multiple entry points with OptiX 6.0.0 using the RTX execution strategy on GTX 1080 Ti or on RTX 8000.

    So David, if someone wanted to do wavefront path tracing with RT Core acceleration… could they do it? with any current or future API?