*server name* [0] include/socket.h:185 WARN Call to connect failed : Connection refused
Failed, NCCL error nvidia-sample.cu:88 'unhandled system error'
This error occur only when I try to make communicator with inter-node environment, and when i try to make communicator with only one node this error will no be occurred.
And my question is that NCCL 2.0 is supporting inter-node communication using Sockets or it supports only with InfiniBand.
I don’t have InfiniBand environment so I haven’t test my program with InfiniBand, so I’m not sure if my program is wrong or my test environment is not supporting inter-node communication.
This error was occurred by NCCL2 environment setting
NCCL2 was trying to use virtual network IF for docker, and it made it impossible to communicate among each node.
This works! Thanks pakio!
I was running distributed tensorflow with horovod using NCCL 2 and docker. I saw the same error message and solved it following your method.
Really appreciate your experience!
I am facing the following error “include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused”, while trying to run distributed pytorch training across 3 nodes. what is the fix?
Here is the complete log :
distributed init (rank 20): tcp://172.31.10.218:9218
| distributed init (rank 18): tcp://172.31.10.218:9218
| distributed init (rank 16): tcp://172.31.10.218:9218
| distributed init (rank 22): tcp://172.31.10.218:9218
| distributed init (rank 23): tcp://172.31.10.218:9218
| distributed init (rank 17): tcp://172.31.10.218:9218
| distributed init (rank 19): tcp://172.31.10.218:9218
| distributed init (rank 21): tcp://172.31.10.218:9218
ip-172-31-9-87:17672:17672 [4] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17672:17672 [4] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17676:17676 [3] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17672:17672 [4] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17672:17672 [4] NCCL INFO rank 20 nranks 24
ip-172-31-9-87:17676:17676 [3] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17676:17676 [3] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17676:17676 [3] NCCL INFO rank 19 nranks 24
ip-172-31-9-87:17672:18605 [4] NCCL INFO comm 0x7f3cdc0551f0 rank 20 nranks 24
ip-172-31-9-87:17676:18606 [3] NCCL INFO comm 0x7fbd600551f0 rank 19 nranks 24
ip-172-31-9-87:17672:18605 [4] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17672:18605 [4] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17672:18605 [4] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17676:18606 [3] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17676:18606 [3] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17676:18606 [3] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17677:17677 [2] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17677:17677 [2] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17677:17677 [2] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17677:17677 [2] NCCL INFO rank 18 nranks 24
ip-172-31-9-87:17677:18607 [2] NCCL INFO comm 0x7fa2400551f0 rank 18 nranks 24
ip-172-31-9-87:17677:18607 [2] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17677:18607 [2] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17677:18607 [2] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17671:17671 [0] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17671:17671 [0] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17671:17671 [0] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17671:17671 [0] NCCL INFO rank 16 nranks 24
ip-172-31-9-87:17671:18608 [0] NCCL INFO comm 0x7fa3ac0551f0 rank 16 nranks 24
ip-172-31-9-87:17671:18608 [0] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17671:18608 [0] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17671:18608 [0] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17674:17674 [1] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17674:17674 [1] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17674:17674 [1] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17674:17674 [1] NCCL INFO rank 17 nranks 24
ip-172-31-9-87:17674:18611 [1] NCCL INFO comm 0x7f8fa80551f0 rank 17 nranks 24
ip-172-31-9-87:17674:18611 [1] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17674:18611 [1] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17674:18611 [1] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17678:17678 [6] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17678:17678 [6] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17678:17678 [6] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17678:17678 [6] NCCL INFO rank 22 nranks 24
ip-172-31-9-87:17678:18612 [6] NCCL INFO comm 0x7f25900551f0 rank 22 nranks 24
ip-172-31-9-87:17678:18612 [6] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17678:18612 [6] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17678:18612 [6] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17672:18605 [4] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17672:18605 [4] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17672:18605 [4] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17672:18605 [4] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17672:18605 [4] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17672:18605 [4] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17672:18605 [4] NCCL INFO misc/group.cu:69 → 2 [Async thread]
ip-172-31-9-87:17671:18608 [0] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17671:18608 [0] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17671:18608 [0] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17671:18608 [0] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17671:18608 [0] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17671:18608 [0] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17671:18608 [0] NCCL INFO misc/group.cu:69 → 2 [Async thread]
ip-172-31-9-87:17674:18611 [1] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17674:18611 [1] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17674:18611 [1] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17674:18611 [1] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17674:18611 [1] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17674:18611 [1] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17674:18611 [1] NCCL INFO misc/group.cu:69 → 2 [Async thread]
ip-172-31-9-87:17675:17675 [7] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17675:17675 [7] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17675:17675 [7] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17675:17675 [7] NCCL INFO rank 23 nranks 24
ip-172-31-9-87:17675:18614 [7] NCCL INFO comm 0x7fc7500551f0 rank 23 nranks 24
ip-172-31-9-87:17675:18614 [7] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17675:18614 [7] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17675:18614 [7] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17678:18612 [6] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17678:18612 [6] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17678:18612 [6] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17678:18612 [6] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17678:18612 [6] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17678:18612 [6] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17678:18612 [6] NCCL INFO misc/group.cu:69 → 2 [Async thread]
ip-172-31-9-87:17675:18614 [7] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17675:18614 [7] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17675:18614 [7] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17675:18614 [7] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17675:18614 [7] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17675:18614 [7] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17675:18614 [7] NCCL INFO misc/group.cu:69 → 2 [Async thread]
ip-172-31-9-87:17673:17673 [5] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17673:17673 [5] NCCL INFO NET/IB : Using interface lo for sideband communication
ip-172-31-9-87:17673:17673 [5] NCCL INFO Using internal Network Socket
ip-172-31-9-87:17673:17673 [5] NCCL INFO rank 21 nranks 24
ip-172-31-9-87:17673:18616 [5] NCCL INFO comm 0x7f84ec0551f0 rank 21 nranks 24
ip-172-31-9-87:17673:18616 [5] NCCL INFO NET : Using interface lo:127.0.0.1<0>
ip-172-31-9-87:17673:18616 [5] NCCL INFO NET : Using interface ens5:172.31.9.87<0>
ip-172-31-9-87:17673:18616 [5] NCCL INFO NET/Socket : 2 interfaces found
ip-172-31-9-87:17673:18616 [5] include/socket.h:369 NCCL WARN Call to connect timeout : Connection refused
ip-172-31-9-87:17673:18616 [5] NCCL INFO transport/net_socket.cu:138 → 2
ip-172-31-9-87:17673:18616 [5] NCCL INFO bootstrap.cu:19 → 2
ip-172-31-9-87:17673:18616 [5] NCCL INFO bootstrap.cu:195 → 2
ip-172-31-9-87:17673:18616 [5] NCCL INFO init.cu:446 → 2
ip-172-31-9-87:17673:18616 [5] NCCL INFO init.cu:593 → 2
ip-172-31-9-87:17673:18616 [5] NCCL INFO misc/group.cu:69 → 2 [Async thread]