memory usage in multi GPU system (NVLink) Linux

Hi,
I’ve been testing Optix 5.1.1 for 3D visualization for a few days. My code is based on the optixMeshViewer example, where I have replaced the existing OBJ loader with a proprietary loader. I am using the example’s PTX files and I provide all the buffers that the PTX code expects. My geometry is a classic triangle soup, and I provide the vertices and the indexes, since I can see that the code computes a geometric normal on the fly if the normal buffer is empty, I don’t provide the normal buffer.
I have access to a machine running Linux with 4 Tesla V100 each with 32GB of memory, connected trough NVLink and I tried to test the memory usage on that machine. Since it’s running Linux the driver should automatically run in TCC mode.
If I use the RT_BUFFER_INPUT flag to create buffers, with a geometry of 145323936 triangles and “Bvh” acceleration structures, the memory occupation is:
-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 0 |
| N/A 38C P0 65W / 300W | 17366MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 40C P0 68W / 300W | 14038MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 39C P0 63W / 300W | 14038MiB / 32480MiB | 11% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 40C P0 69W / 300W | 14038MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+

While if I use RT_BUFFER_INPUT_OUTPUT I get:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 0 |
| N/A 37C P0 73W / 300W | 14042MiB / 32480MiB | 99% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 38C P0 76W / 300W | 10714MiB / 32480MiB | 72% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 38C P0 70W / 300W | 10714MiB / 32480MiB | 66% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 38C P0 75W / 300W | 10714MiB / 32480MiB | 76% Default |
±------------------------------±---------------------±---------------------+

If I use only 1 GPU I get with RT_BUFFER_INPUT_OUTPUT:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 0 |
| N/A 31C P0 41W / 300W | 11MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 32C P0 43W / 300W | 11MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 35C P0 54W / 300W | 17370MiB / 32480MiB | 14% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 33C P0 45W / 300W | 11MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Is this the expected behavior? The limit that I can push from 1 GPU to 4 GPUs is only about 3 GBytes. Am I missing something?

That is to be expected, as long as you’re not even close to the 32 GB per board limit.

As long as there is no need to do peer-to-peer access over the NVLINK bridge, OptiX will load the geometry to all boards for better multi-GPU rendering performance.
Once it’ll hit memory limits, it will migrate buffers to individual boards and use peer-to-peer access.

Though this won’t work if all your geometry is in one big buffer! You’d need to split them up into smaller chunks of a few million primitives so that they can be migrated individually.

Please see this thread as well and esp. the links in comment #2 and #4:
[url]https://devtalk.nvidia.com/default/topic/1027203/?comment=5226059[/url]
(I would not recommend to use the progressive API though. It’s faster to do manual accumulation.)

The difference between the input and input_output buffers is that input_output buffers are not allocated on the device in all shipping OptiX versions. You should never need to allocate geometry attribute buffers as input_output buffers.
See more about that here: [url]https://devtalk.nvidia.com/default/topic/1036340/?comment=5264830[/url]

Use the Trbvh acceleration structure builder and if you’re only using triangles with float3 vertices and int3 indices, use the Acceleration properties to pick the specialized and faster builder implementation.
[url]https://devtalk.nvidia.com/default/topic/1022634/?comment=5211794[/url]
Example code for an interleaved vertex attribute format here:
[url]optix_advanced_samples/Application.cpp at master · nvpro-samples/optix_advanced_samples · GitHub

Thank you for your reply. I will implement your recommended changes.

Hi,
I made some changes to my test code, so I split the vertex buffer in multiple ones (approximately 50 to 200) depending on the loaded mesh.
I also changed the CUDA code so that vertex indexes are now computed on the fly instead of being stored and copied from the CPU code, to save some memory on the GPU.
I made some more tests and it seems like the acceleration structures occupy in any case way more memory than the vertex buffers, so even if they are moved across GPUs (in case of NVLink), they are not the bulkiest object in GPU memory.
This is some data I collected from a data set with 13,464,000 triangles (there are here 85 vertex buffers)
With 4 GPUs and NoAccel I have a memory usage of 1013MiB, 849MiB, 849MiB, 849MiB
With Bvh 2125MiB, 1815MiB, 1815MiB, 1815MiB
With Sbvh 2725MiB, 2415MiB, 2415MiB, 2415MiB
With Trbvh 3667MiB, 1815MiB, 1815MiB, 1815MiB

The bigger case with 145,323,936 triangles (here about 200 vertex buffers):
NoAccel 2903MiB, 2739MiB, 2739MiB, 2739MiB
Bvh 16087MiB, 12759MiB, 12759MiB, 12759MiB
In this case both Sbvh and Trbvh will go out of memory.

We would like to be able to predict whether a triangle mesh will fit into GPU memory, considering the various accerelation algorithm used (Bvh, Sbvh or Trbvh), is there a rule of thumb?
The second question would be: are the acceleration structures moved across GPUs (with NVLink)?
Thank you

Maybe I wasn’t clear enough on the geometry partitioning into smaller blocks.
That’s not about the geometry nodes alone, it’s about the acceleration structures objects on the GeometryGroups above the GeometryInstances. If you’re using a single GeometryGroup in your scene graph with the 145 MTriangles case, that’s not going to change the acceleration structure when using a single or multiple Geometry nodes.

Assuming your out of memory errors are not on the host, you should be able to load 145 MTriangles into a 32 GB board just fine as seen from the Bvh case.
Though esp. the Trbvh builder has a high temporary memory overhead during build. See the Trbvh chunk_size acceleration property to overcome that here:
[url]http://raytracing-docs.nvidia.com/optix/guide/index.html#host#acceleration-structure-properties[/url]

But that shouldn’t be a problem if each Acceleration contains only a few million triangles, which can then be accessed via peer-to-peer, and yes, that is about the acceleration structures, attribute buffers and textures.

Thank you for the clarification, I was able to visualize the 145MTriangles case using Trbvh. I was in fact using only one GeometryGroup.