Acceleration getData and setData

orenouard · July 7, 2017, 8:13pm

Hi,

I see that the Acceleration getData and setData methods are getting depreciated. I’m trying to serialize an acceleration structure so I can save it / re-use it. Working with very large static data sets and might need to load only parts of the set as well, so I’ll probably end up with some hierarchical file format where I can request only a region of the dataset. Was hoping to avoid replicating basically an equivalent structure, is there any other way to query the acceleration tree nodes or any plan to get this possibility back in?

Olivier

orenouard · July 8, 2017, 3:07pm

To add that not being able even to query top bounds is strange, the model loading examples resort to storing a separate bbox min and max with loaded models just to be able to compute scene bounds even though an accelerator will have this info once built.

dlacewell · July 10, 2017, 3:32pm

Hi Olivier. Being able to query the top bounding box through the API is a fair request; I’ll forward that along for consideration.

But we really don’t want to expose the internal data layout of the Bvh via getData/setData.

The goal is for the “Trbvh” builder to be fast enough in practice for any scene. How many primitives do you have, roughly, and what’s your current build time with Trbvh (and on what hardware)? Are you using triangles or custom primitives, e.g., spheres? If you can give me rough timings, I’ll tell you if they’re in the expected range.

orenouard · July 10, 2017, 9:01pm

Hi and thanks,

I’m using triangle meshes for now and Trbvh. Probably will try to support ply point clouds as voxels as well later on, but for now purely triangles, and quite uniformly sized usually.

Here are the device specs :

Enabled Device id: 0
Device: GeForce GTX 980 Ti, clock 1228000, compute capability [5, 2]
Memory: 5337261670 / 6442450944, max textures count: 1048576
Multiprocessor count: 22, threads per block: 1024
CUDA GPU Device 0:GeForce GTX 980 Ti cm 5.2

Times are not really an issue now since the acceleration will only have to be rebuilt at each load of a new object anyway, so the file loading and parsing times will dwarf the Acceleration build times, here is for a roughly 2.5M triangles model :

Frame 1 for entry 0 took: 2262.53 [msecs]
Frame 2 for entry 0 took: 0.00176 [msecs]

(As opposed to about 7000 ms for a plain BVH Accelerator)

So time difference between both frames should about acount for buffers copy from host to device and building of the acceleration structure. Not really an issue when compared with file parsing times anyway as you can see below :

filesize: 283338207
load time: 0.073019 [msecs]

of threads = 8

total parsing time: 22908.5 ms
line detection : 6167.46 ms
alloc buf : 653.088 ms
parse : 12704.9 ms
merge : 3372.69 ms
construct : 2094.18 ms
Geometry triangles count = 2589196
upload to device time: 9779.82 [msecs]
bmin = -9.359957, -8.549820, 1.697836
bmax = 21.165928, 5.092374, 10.014181
Available main device memory: 4084815462

Memory however will be a problem. I’d like to be able to work on 20M triangles and up, which I can’t fit on the GPU. So was planning to load in host memory and process them in chunks on the GPU. However I’ll need to treat space coherent chunks, so I’ll need a BVH on the host as well. I was hoping to save memory and ease space coherent swapping by holding a full model BVH in host memory, and uploading only a sub-tree of it to the device each time I swap chunks.

Maybe, without giving access to the innards of it, allowing the acceleration structure to be mapped to host memory and unmapped back to device could be a possibility?

So I guess I’d need to build a BVH myself and use the RTUtraversal_api rather ?

Because now the “built-in” solution I would not be very optimal : build a BVH on host, use it to select space coherent chunks of desired size, upload one chunk to device and ask for a Trbvh build on it each time (and then, speed might become an issue again since a brand new Trbvh will have to be rebuilt for each chunk?)

Keith_Morley · July 10, 2017, 10:00pm

Hello Olivier,

Unfortunately, what you are hoping to do is not possible with the current API even. The serialized data returned by rtAccelerationGetData is intended to be an opaque data blob. You would not be able to parse or deconstruct it. If you did reverse engineer the data structure, it would not be possible to feed only a portion of the structure back into optix either.

Are you rendering each chunk many times or only a single time? Also, I suspect that the above timings you give above (2262ms for frame one) includes JIT kernel compilation which might be the majority of that time as opposed to acceleration building.

It sounds like your two major concerns are:

Having the ability to perform coarse spatial binning of your data to feed large chunks to optix to render in a multi-pass fashion
Doing this without introducing large memory overheads

Can you accomplish this with a simple course binning on the host which would require very little memory overhead? After all, it seems that you would only need a couple of top levels of a balanced BVH to get down reasonable chunk sizes (eg, 1 million triangles per bin). Such a BVH (or octree or grid) would require VERY little memory (a few kB maybe) and at the end of the day, each triangle would only have to pass through a full Trbvh build once.

One other thing to note is that 20mil triangles should be able to fit onto some of our cards with higher memory capacity if you have access to any.

Thanks,
Keith

orenouard · July 10, 2017, 11:16pm

Hi Keith and thanks,

Ok still new to Optix so no hope to use a custom accelerator and RTUtraversal_api, each chunks will need a Trbvh rebuild on device?

Yes, I’m just getting times on host side between launches.

And yes coarse binning on the host sounds good. Any way to avoid a full JIT recompile when you’re just changing the scene top object? I have a very simple scene structure, could actually update the top object geometry buffers in place if it helped.

Yes about the card, although by then I’d probably get myself more host ram and a bigger dataset as well :D. And the idea is to have something that behaves as well as possible with the 9xx and 1xxx series of consumer cards. So far found that as soon as I let the card do memory paging to host I loose a lot of performance as expected, so checking wether doing several launches of a subset of the data is working better. I probably need to take a look at the prime api more closely too.

Thanks!

droettger · July 11, 2017, 9:32am

With respect to the RTU traversal API, it’s obsolete, don’t use it.
[url]https://devtalk.nvidia.com/default/topic/965791/?comment=4982469[/url]
Rather look at the OptiX 4.1.0 optixRaycasting example for a similar functionality.

Changing the contents of an OptiX Geometry buffer requires an acceleration structure rebuild. You would need to mark it dirty when changing the underlying information. The acceleration structure rebuild will happen during the next launch. There is no way around that.

If you have issues building the acceleration structure with the Trbvh builder due to VRAM limits, there is a chunking mechanism in the Trbvh builder which can reduce the maximum required memory during build time. Have a look into Table 4 Acceleration Structure Properties inside the OptiX Programming Manual for more information.

orenouard · July 11, 2017, 12:01pm

Thanks,

I guess any acceleration structure rebuild will imply a JIT recompile since I think I read in one of the presentations that there are some acceleration structure dependant optimisations done during the JIT compilation?

Olivier

droettger · July 11, 2017, 12:26pm

The PTX kernel code should only be compiled the first time you launch each raygeneration entry point in OptiX 4.x and recompile only when you change the PTX code, which includes declaring new variables etc.

An exchange of geometry buffer data should normally only affect the acceleration structure, otherwise this demo wouldn’t have been possible because the water geometry changed completely each frame. [url]NVIDIA Kepler real-time raytracing demo at GTC 2012 - The Verge - YouTube

Just give it a try.

orenouard · July 11, 2017, 2:15pm

Ok thanks!