launch time out with Optix Prime 3.7 beta 3

Hello,

I’m trying to use Optix Prime 3.7 beta 3 to replace my own OpenCL/CUDA ray tracer.
Unfortunatly, when i launch my app, display driver times out or my computer crash with a blue screen (win 8.1).

The scene is small with 926120 triangles and 468827 vertices. I create 2 queries of 1048576 rays (ORIGIN_TMIN_DIR_TMAX) and so 1048576 hits (D_TRIID_U_V).

configuration : NVIDIA GeForce GTX TITAN, driver 347.52, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Are there limitations on number of queries, query size or model size ? Where can i find specifications about these limitations ?

I traced exec with OPTIX_API_CAPTURE but calls seem to be well ordered.

4
64
Platform: Windows
Capture time: 2015-02-25 15:57
%%
rtpContextCreate( 257, 000000DBF11F84F0 )
  res = 0
  hdl = 000000DBF11F8670
rtpContextSetCudaDeviceNumbers( 000000DBF11F8670, 1, 000000DBAAC810C0 )
  val = 0
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1025, 513, 0000000904740000, 000000DBAAC810C0 )
  res = 0
  hdl = 000000DBF11F8E20
rtpBufferDescSetRange( 000000DBF11F8E20, 0, 926120 )
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1056, 513, 0000000905580000, 000000DBAAC9F1C0 )
  res = 0
  hdl = 000000DBF11F8990
rtpBufferDescSetRange( 000000DBF11F8990, 0, 468827 )
  res = 0
rtpBufferDescSetStride( 000000DBF11F8990, 32 )
  res = 0
rtpModelCreate( 000000DBF11F8670, 000000DBAAC94250 )
  res = 0
  hdl = 000000DBF11F8A10
rtpModelSetTriangles( 000000DBF11F8A10, 000000DBF11F8E20, 000000DBF11F8990 )
  file::prime::0000000904740000 = oac.prime.000000.potx // indices
  file::prime::0000000905580000 = oac.prime.000001.potx // vertices
  res = 0
rtpModelUpdate( 000000DBF11F8A10, 8193 )
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1089, 513, 00000009148C0000, 000000DBAABD3150 )
  res = 0
  hdl = 000000DBF078C9D0
rtpBufferDescSetRange( 000000DBF078C9D0, 0, 1048576 )
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1123, 513, 00000009168C0000, 000000DBAABC7B50 )
  res = 0
  hdl = 000000DBF078CA50
rtpBufferDescSetRange( 000000DBF078CA50, 0, 1048576 )
  res = 0
rtpModelFinish( 000000DBF11F8A10 )
  res = 0
rtpQueryCreate( 000000DBF11F8A10, 4097, 000000DBAABCB350 )
  res = 0
  hdl = 000000DBF078CAD0
rtpQuerySetRays( 000000DBF078CAD0, 000000DBF078C9D0 )
  file::prime::00000009148C0000 = oac.prime.000002.potx // rays_api
  res = 0
rtpQuerySetHits( 000000DBF078CAD0, 000000DBF078CA50 )
  file::prime::00000009168C0000 = oac.prime.000003.potx // hits_api
  res = 0
rtpQueryExecute( 000000DBF078CAD0, 16385 )
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1089, 513, 000000091A0C0000, 000000DBAA9E1870 )
  res = 0
  hdl = 000000DBAC3CFB70
rtpBufferDescSetRange( 000000DBAC3CFB70, 0, 1048576 )
  res = 0
rtpBufferDescCreate( 000000DBF11F8670, 1123, 513, 000000091C0C0000, 000000DBAA9B27C0 )
  res = 0
  hdl = 000000DBAC3CFBF0
rtpBufferDescSetRange( 000000DBAC3CFBF0, 0, 1048576 )
  res = 0
rtpModelFinish( 000000DBF11F8A10 )
  res = 0
rtpQueryCreate( 000000DBF11F8A10, 4096, 000000DBAA95ADE0 )
  res = 0
  hdl = 000000DBAC2BB2D0
rtpQuerySetRays( 000000DBAC2BB2D0, 000000DBAC3CFB70 )
  file::prime::000000091A0C0000 = oac.prime.000004.potx // rays_api
  res = 0
rtpQuerySetHits( 000000DBAC2BB2D0, 000000DBAC3CFBF0 )
  file::prime::000000091C0C0000 = oac.prime.000005.potx // hits_api
  res = 0
rtpQueryExecute( 000000DBAC2BB2D0, 16385 )
  res = 0
rtpQuerySetCudaStream( 000000DBF078CAD0, 000000DBB0840D30 )
  res = 0
rtpBufferDescSetRange( 000000DBF078C9D0, 0, 1048576 )
  res = 0
rtpBufferDescSetRange( 000000DBF078CA50, 0, 1048576 )
  res = 0
rtpQueryExecute( 000000DBF078CAD0, 16385 )
  res = 0

Thanks.

Hey z00,

just to get the ball rolling with your app you could make an experiment and try to make your query exec calls synchronous (either by removing the async flag or by calling rtpQueryFinish after every execute call. Does that make your app work?

Hello Heiko,

I tried to remove async calls from Optix prime and CUDA removing async flags and streams. i had the same issue. My app is multi-threaded also but documentation says that is supported.

  • Should i call all rtp functions in the same thread ?

I changed Windows hardware time out in register editor so rtpQueryExecute completed in more than 20 seconds just for 1 million rays and execution is very slow. It looks like memory swap in GPU process explorer.
I tried also to change vertices and triangles format removing strides but it didn’t change the result.
My app needs a lot of device memory.

  • Is there a prerequisit on available memory before initialization of OptiX ?

  • Should i create OptiX context before my CUDA buffers (like for OpenGL interop) ?

Thanks.

When I understand it right, you try to use a single Prime context with multiple queries, and each query is executed in a separate thread, right?

I believe this is not supported at the moment. And yes, I believe you should call all rtp functions in the same thread (when you only use a single device, meaning a single prime context). When you want use multiple devices (manually managed), then I guess you should create for every device a separate Prime context (and bufferdescs and model, and query) and set the device number of the context with rtpContextSetCudaDeviceNumbers. In that case you should be able to use multiple threads (one fixed thread for every prime context using a single GPU).

So, to summarize I believe this should work:
One Prime context with multiple async queries in the same thread
Multiple Prime contexts, with multiple async (or just one) queries but every context is handled in its own thread (and the selected devices per context should be mutually exclusive)

This is not supported (I think):
One Prime context, with multiple async queries, and each query handled in a different thread

There should not be a prerequisite on available (device) memory before initializing a context.
The order you use for creating OptiX contexts and CUDA buffers should not matter.

I initialize a Prime context in one thread for all devices (a loop). I execute 2 queries per device in others separated threads. So i tried to put all Prime calls in the same thread but i had the same issue.

I tried to play with number of rays in queries. I obtained these results :

Device GeForce GTX TITAN.
NVidia OptiX create model...0.058000 s
NVidia OptiX create first query...2.679000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 23831
NVidia OptiX create second query...0.000000 s
thread 0 number of rays in second query : 23831
thread 0 allocated memory : 76670756

//////////////////////////////////////////////

Device GeForce GTX TITAN.
NVidia OptiX create model...0.057000 s
NVidia OptiX create first query...0.364000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 47662
NVidia OptiX create second query...0.001000 s
thread 0 number of rays in second query : 47662
thread 0 allocated memory : 80102420

/////////////////////////////////////////////

Device GeForce GTX TITAN.
NVidia OptiX create model...0.055000 s
NVidia OptiX create first query...2.479000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 95325
NVidia OptiX create second query...0.002000 s
thread 0 number of rays  in second query : 95325
thread 0 allocated memory : 86965892

///////////////////////////////////////////

Device GeForce GTX TITAN.
NVidia OptiX create model...0.059000 s
NVidia OptiX create first query...0.003000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 190650
NVidia OptiX create second query...0.000000 s
thread 0 number of rays  in second query : 190650
thread 0 allocated memory : 100692692

///////////////////////////////////////////

Device GeForce GTX TITAN.
NVidia OptiX create model...0.057000 s
NVidia OptiX create first query...0.004000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 381300
NVidia OptiX create second query...0.002000 s
thread 0 number of rays  in second query : 381300
thread 0 allocated memory : 128146292

///////////////////////////////////////////

Device GeForce GTX TITAN.
NVidia OptiX create model...0.061000 s
NVidia OptiX create first query...0.005000 s
thread 0 number of pixels : 1048576
thread 0 number of rays in first query : 762600
NVidia OptiX create second query...6.486000 s
thread 0 number of rays  in second query : 762600
thread 0 allocated memory : 183053492

///////////////////////////////////////////

It is very strange that number of rays in queries give random creation time. I thought it was a synchronization issue but, with one thread for OptiX calls, it is the result.