Is it possible to use NCCL2 for e.g., allreduce across multiple nodes over TCP/IP, without using MPI?
I’ve seen this capability mentioned, but can’t find any way to specify the address of other nodes in the nccl docs, and the only examples (and horovod) seem to be using MPI.
For our application we need to set up a nccl communicator across multiple processes on separate ec2 machines, but are not using MPI.
Hm, how does NCCL2 figure out the IP address of the other nodes? I am interested in using NCCL2 for cross machine allreduce. The handle is just an opaque identifier right?
I have similar problem as above which do not want to use MPI to broadcast the ncclUniqueId. My own situation/context is that I am working on multi-node multi-GPU deep learning using NCCL2 to all-reduce the gradients without MPI.
I have three questions:
The best way or most convenient way to do is to broadcast the ncclUniqueId using UDP socket?
For multi-node NCCL, we cannot use ncclCommInitAll instead of ncclCommInitRank?
Instead of broadcasting the ncclUniqueId, can we initialize all the communicators at one nodes and send them to different nodes instead?