OptiX abstracts multiple GPUs in one context so that the user facing side of the API is unaware of that.
OptiX does the scheduling and provides the results in an output buffer.
You do not know which GPU wrote what inside the output buffer that way.
This works nicely with gather algorithms where each launch index writes into unique memory locations.
With scatter algorithms which write into the output buffers using atomicAdd this won’t work because only accesses from threads on one GPU are serialized, not across multiple GPUs.
That’s why I said it’s more or less automatic. “Less” was explained with the caveats I listed.
The main problem here is that the scheduling is done by OptiX. When running on multiple-GPUs you have no control about which GPU handles what launch index.
Means you cannot simply gather the GPU local buffers from multiple GPUs into a composited output buffer, because you do not know which GPU is going to work on what data! For example, you might fetch from the scratch buffer of one GPU which wasn’t actually written to on that but another GPU and you missed to read the proper result.
Actually yes, if writes from multiple GPUs are to completely disjunct pinned memory areas that might work.
There is a discussion about atomics on pinnned memory here:
https://stackoverflow.com/questions/23193151/atomic-operations-in-cuda-kernels-on-mapped-pinned-host-memory-to-do-or-not-to
cudaGetDevice() is a CUDA runtime host function. You can’t call that in an OptiX kernel.
Remains the crucial question how you would be able to identify the individual GPUs inside the OptiX device code, and the only solution I can see inside the OptiX documentation (highlighted below) is to use a CUDA interop input buffer where the application must provide device pointers for all of the devices.
Gathering all relevant documentation:
3.4.2.2 enum RTbufferflag
RT_BUFFER_GPU_LOCAL An RT_BUFFER_INPUT_OUTPUT has separate copies on each device that are not synchronized.
3.8.3.17 RTresult RTAPI rtBufferCreate
The flag RT_BUFFER_GPU_LOCAL can only be used in combination with RT_BUFFER_INPUT_OUTPUT. RT_BUFFER_INPUT_OUTPUT and RT_BUFFER_GPU_LOCAL used together specify a buffer that allows the host to only write, and the device to read and write data. The written data will never be visible on the host side and will generally not be visible on other devices.
Means there are input_output buffers possible per GPU, but you can’t read them back to the host by mapping them directly or via a compositing step reliably because the scheduling is not under your control.
7.2.2. Restrictions
An application must retrieve or provide device pointers for either one or all of the devices used by a buffer’s OptiX context. Getting or setting pointers for any other number of devices is an error. Getting pointers for some devices and setting them for others on the same buffer is not allowed. Calling rtBufferMap or rtBufferMarkDirty on a buffer with pointers retrieved/set on all of multiple devices is not allowed. Calling rtBufferSetDevicePointer on output or input/output buffers is not allowed.
Means it’s possible to have different input buffers per GPU. That’s what you need!
7.2.1. Buffer Synchronization
Multi-Pointer Synchronization
If OptiX is using multiple devices it performs no synchronization when an application retrieves/provides buffer pointers for all the devices. OptiX assumes that the application will manage the synchronization of the contents of a buffer’s device pointers.
7.2.3. Zero-copy pointers
With a multi-GPU OptiX context and output or input/output buffers, it is necessary to combine the outputs of each used device. Currently one way OptiX accomplishes this is by using CUDA zero-copy memory. Therefore rtBufferGetDevicePointer may return a pointer to zero-copy memory. Data written to the pointer will automatically be visible to other devices. Zero-copy memory may incur a performance penalty because accesses take place over the PCIe bus.
Means it’s not actually possible to have different output buffers per GPU.
Again, to be able to distinguish the individual GPUs, the only way according to the above documentation is to use CUDA interop and use an input buffer with a different value for rtBufferSetDevicePointer on all GPUs in the context.
That means automatic synchronization with OptiX won’t happen and you could write something GPU specific into these. In your case that buffer would need to hold just a single value (unsigned integer) with a zero based GPU ID.
Now your algorithm would need to write to, for example, a number-of-GPUs times bigger buffer with a consistent addressing based on that GPU ID and accumulate at disjunct memory locations that way.
I’d be interested if that mechanism works.