Performance of OptiX prime

Hi,

I’m interested in exploring the viability of using OptiX for calculating viewfactors as part of an enclosure radiation simulation code. I adapted the simplePrimePP example from the SDK to take in a mesh file, and use it to create an OptiX Prime model. Then, I generate rays for each face and execute a closest hit query. The results seem to be correct, but I’m concerned that I might not be getting the most out of my GPU.

In some recent presentations (http://i.imgur.com/kjqOy6k.png and http://imgur.com/Z5hRxe7), there are stats that indicate a GTX Titan (which is what I’m testing on) is capable of ~ 300M rays/sec, but my timings are coming in at about 50-60M rays/sec. I appreciate that marketing numbers are often “theoretical maxima”, but getting only ~20% of the performance that other people are claiming makes me think I might be approaching the problem in the wrong way!

Some additional information about the performance tests:

-using float3 for ray origins and directions, no double precision
-test meshes are tessellated spheres, looking similar to this: http://i.imgur.com/ioeJ6fR.png
-rays are generated to have some spatial locality (to try and minimize thread divergence when traversing BVH). Here’s a little animation indicating how the rays are numbered on one example triangle: http://i.imgur.com/Nk3anAz.gif
-timings are best-case, not counting any transfers, just the query execution:

//
// Execute query
//  
Query query = model->createQuery(RTP_QUERY_TYPE_CLOSEST);
query->setRays( rays.count(), Ray::format, rays.type(), rays.ptr() );
query->setHits( hits.count(), Hit::format, hits.type(), hits.ptr() );
cudaDeviceSynchronize();
gettimeofday(&then, NULL);
query->execute( 0 );
query->finish(); 
gettimeofday(&now, NULL);
double query_time = time_elapsed(then, now);

A sphere with 131072 triangles and 1024 rays / tri takes 2.30383 seconds on this machine, w/ cuda 7.5, OptiX 3.9, driver version 352.93, on a GTX Titan.

I’ve read through chapters 9 and 11 in the OptiX programming guide to see if there are any obvious ways I might be slowing things down, but nothing leaps out at me. Can anyone offer some insight on where I might look for problems, and how to effectively profile an Optix Prime application to find potential problems? I have experience with CUDA, but this is my first time working with OptiX.

Thanks

I would recommend to work through all other OptiX Prime examples as well.
The simplest one is most likely not optimized at all. For example by default simplePrimePP is not running on the GPU but the CPU to run everywhere (default RTP_CONTEXT_TYPE_CPU) and puts queries in host buffers (default RTP_BUFFER_TYPE_HOST) instead of device buffers. Then it executes a single synchronous query.

The results in the slides are from renderers which accumulate many frames progressively with fully asynchronous queries and multiple queries in flight. It also depends on the number of rays per query. Neither ridiculously small or big amounts will get the best performance. Keep it well above 64k to some millions for best GPU load.

How do you structure your queries?
Is that a single query with 128k * 1024 directions?

Or 1024 queries with 128k rays each?
Latter would allow to influence the overall ray directions in each query for best convergence.

When you say this runs for more than 2 seconds, what’s your OS version?
If you’re under Windows, kernel drivers must not run longer than 2 seconds or you’ll get a timeout (TDR).

It’s not clear what you’re actually intersecting. You say you have a single sphere and shoot 1024 rays per triangle, and the animated GIF looks like a cosine weighted hemispherical distribution of ray directions. But what is hit when doing that on the outside of a sphere? Are all results a miss?

I can certainly do that. I admit, I left the geometry buffers and ray buffers in host memory, thinking that they would just make a copy when making the model and query(which is indeed slow, but was also not being timed). Upon moving everything over to CUDA_LINEAR buffers, performance is improved about 15% to ~65M rays/sec, but this is still less than I was hoping for.

I am doing one single query. What do you mean by “influence the overall ray directions in each query for best convergence”? So far, my tests show that the smaller tests achieve lower throughput than the larger ones:

131072 tris with 64 rays / tri: 51,000,000 rays/sec
131072 tris with 1024 rays / tri: 66,000,000 rays/sec

Should I expect performance to improve if I break up my query into a series of smaller ones? If so, do I create multiple Query objects for each one, or can I use the same one:

Query query = model->createQuery( ... );

for each batch {
  query->setRays( ... );
  query->setHits( ... );
  query->execute( 0 );
}

query->finish();

There sphere is the enclosure, so the normals are actually taken to be inward-pointing. Each triangle has the hemispherical ray distribution at its centroid (and possibly additional sample points), and intersects those rays with the rest of the enclosure. That information is used to approximate the view factor integration (i.e. each ray’s payload is a fraction of the side’s radiated energy). It’s like a hemicube for global illumination, but in grayscale.

I’m in Ubuntu 14.04, and haven’t had any problems with timeouts

Let me chime in with one other idea: would you mind instrumenting the query in primeSimplePP exactly as you showed above, with the cudaDeviceSynchronize(), and posting the results? This would be a useful baseline.

Important: make sure you pass the “-c cuda” flag to run on the device. Also set “-w” to something like 2048 to shoot a bigger batch of rays.

Some results on a Quadro K5000 I happen to have plugged in right now:

> ./bin/primeSimplePP -c cuda -w 2048
15.4891 ms.
number of rays: 2533376
Mrays per second: 155.982

Here’s my output:

> ./primeSimplePP -w 4096 -c cuda
triangles: 5804
rays: 10133504
1.24424e+08 rays per second

For completeness, I’ll also include the results from my sphere refinement study:

tris   rays/tri  query time(s)     rays / sec
  2048,     64,      0.001244,      1.05363e+08
  2048,    256,      0.003621,      1.44791e+08
  2048,   1024,      0.010933,      1.91819e+08
  8192,     64,      0.005633,      9.30744e+07
  8192,    256,      0.019376,      1.08235e+08
  8192,   1024,      0.062970,      1.33216e+08
 32768,     64,      0.033286,      6.30040e+07
 32768,    256,      0.117562,      7.13548e+07
 32768,   1024,      0.347062,      9.66814e+07
131072,     64,      0.163259,      5.13822e+07
131072,    256,      0.602637,      5.56793e+07
131072,   1024,      2.018040,      6.65090e+07

The primeSimplePP example has similar rays and geometry complexity to my 8192 x 1024 case, and similar rays / second.

So, I guess this might just be what I should expect from my card. I don’t know under what conditions someone might achieve the quoted 300M rays / sec, except when intersecting with a very simple geometry!

Thanks for the input-- I’m still willing to try out any ideas to squeeze more performance out of this problem!

Wow, those numbers look too low. Much lower than our internal benchmarks on similar hardware, even for secondary rays. Is the ‘tris’ column also the total number of triangles in the scene that you’re intersecting, or are those only used to set the ray origins?

Would you mind sending us code to we can study this in detail? I’m going to PM you an email address.

One really quick thing to check – if you have multiple GPUs in the machine, make sure you set only the Titan X to be visible, using CUDA_VISIBLE_DEVICES. Prime by default will split rays across GPUs to use the extra memory, but that’s not what you want here for peak performance.

Doh, the primeSimplePP command I posted earlier still puts the ray data in host buffers by default. Here’s the version for device buffers on the K5000:

./bin/primeSimplePP -c cuda -b cuda -w 4096
number of rays: 10133504
number of triangles: 5804
time: 0.015691
Mrays/sec: 645.815

So we’re at 645 Mrays/sec now.

But here’s the same thing on an M6000, which should have roughly similar perf as your Titan on this example.

./bin/primeSimplePP -c cuda -b cuda -w 4096
number of rays: 10133504
number of triangles: 5804
time: 0.00682902
Mrays/sec: 1483.89

1.4 Billion rays per sec for primary rays. You won’t get that for secondary rays, but I would expect something in the 300M - 500M range.