Using two GTX 1080 Ti is much slower than one GTX 1080 Ti with the PPM in Optix Advanced Samples. I would like to know how to benefit from multiple GPUs in such a use case, more generally, the use cases that have multiple passes + bandwidth hungry.
It seems a major bottle neck is the output buffers that multiple GPUs are writing into. What if we could make photon maps completely stay in each GPU locally? In gather pass, each GPU just read its own photon map. With applying a kd tree construction on each GPU independently, the whole photon map construction could be duplicated on each GPU to avoid writing to PCIE. I am not sure how Optix 5.0 could do this right now, it could be achieved with two features:
Optix launch allows GPUs to write their local buffers instead of just in cooperation mode(same output)
2, Optix launch allows GPUs read from corresponding local buffers, for example making variable “rtBufferLocal<> photon_map”, when writing to a final output, GPUs could still in cooperation mode, but they read their own photon_map in local memory. Of course rtBufferLocals are not automatically synced between GPUs.
Is that safe to change “gather_buffer” to be RT_BUFFER_GPU_LOCAL?
Changing to RT_BUFFER_GPU_LOCAL improves gather pass from 0.018s to 0.008s on my two GPUs setup. It seems working, but I am not sure if it is by luck. “gather_buffer” is used by different passes, so it should be only working if Optix distributes same area of “gather_buffer” for each GPU and for each launch.
In fact, any “accumulation” like buffers could get benefits for such use case. That even if “writes from multiple devices are not coherent, as a separate copy of the buffer resides on each device”, as long as Optix supports non-random access of local buffer across different optix launch, we can avoid copying back to host.
I don’t see a variable named “gather_buffer” in the OptiX Advanced Examples. Please be more specific.
Citing some information from the OptiX API Reference as listed in another thread on this forum before:
[i]3.4.2.2 enum RTbufferflag RT_BUFFER_GPU_LOCAL An RT_BUFFER_INPUT_OUTPUT has separate copies on each device that are not synchronized.
3.8.3.17 RTresult RTAPI rtBufferCreate
The flag RT_BUFFER_GPU_LOCAL can only be used in combination with RT_BUFFER_INPUT_OUTPUT. RT_BUFFER_INPUT_OUTPUT and RT_BUFFER_GPU_LOCAL used together specify a buffer that allows the host to only write, and the device to read and write data.
The written data will never be visible on the host side and will generally not be visible on other devices.[/i]
The bold texts together imply that there is no secure accumulation possible over multiple launches on a multi-GPU context on RT_BUFFER_GPU_LOCAL buffers, because you do not control the scheduling of launch indices per GPU.
(EDIT: I was wrong about that. Actually the OptiX multi-GPU load balancer is static! See post further down.)
Final accumulation needs to happen in real input_output buffers.
It is a bit sad to know we have this limitation on local buffers. I understand the pros of dynamic GPU scheduling, but if because of this we have to syncing over PCIE, is a bit pain in the ass. Maybe, maybe an option to hint the scheduling, for example users can define 40% fixed indices for GPU1, 60% fixed indices for GPU2 …
Sorry about the “gather_buffer”, I didn’t remember I dragged radius2, photon_count and flux out from HitRecord to be a accumulation buffer.
Yes, I know. We’ll keep this in mind.
I had some painful experiences with a supposedly homogeneous multi-GPU system (dual Quadro K6000) in an older system. While the boards were identical, one board was connected to a PCI-E 16x Gen2 slot and the other to a PCI-E 4x Gen1 slot. No chance to get good scaling on that. The slow PCI-E slot choked it, like only 10-15% improvement over using just one of the boards with a standard full image path tracing.
After discussing this internally, using an RT_BUFFER_INPUT_OUTPUT with RT_BUFFER_GPU_LOCAL for accumulation on multiple GPUs is actually working!
While the work distribution of the load balancer is still abstracted internally to be able to implement various schemes, it’s static over multiple GPUs. Means identical launch dimensions access identical launch indices and gathering algorithms will work.
You would just need to output the final accumulated result to another output buffer to make it accessible on the host. That step could also do the tonemapping and conversion from float to unsigned byte formats (recommended RGBA32F or RGBA16F to BGRA8) to reduce the PCI-E load even more.