Hi,
I may have an incorrect topology. The gpu communication is strange. Do you have any idea? Thank you.
Doing “lspci –t” on the gpu machine I’m getting following:
| +-1f.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
| \-1f.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
+-[0000:80]-+-02.0-[81-84]----00.0-[82-84]--+-08.0-[83]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU2
| | \-10.0-[84]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU3
| +-03.0-[85-8e]----00.0-[86-8e]--+-08.0-[87-8a]----00.0-[88-8a]--+-08.0-[89]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU4
| | | \-10.0-[8a]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU5
| | \-10.0-[8b-8e]----00.0-[8c-8e]--+-08.0-[8d]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU6
| | \-10.0-[8e]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU7
| +-04.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 0
| +-04.1 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 1
| +-04.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 2
….
| +-1e.3 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
| +-1e.4 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Power Control Unit
| +-1f.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
| \-1f.2 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
\-[0000:00]-+-00.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
+-01.0-[01]--+-00.0 Intel Corporation Ethernet Controller 10-Gigabit X540-AT2
| \-00.1 Intel Corporation Ethernet Controller 10-Gigabit X540-AT2
+-02.0-[02-05]----00.0-[03-05]--+-08.0-[04]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU0
| \-10.0-[05]----00.0 NVIDIA Corporation GK210GL [Tesla K80] ---- GPU1
+-03.0-[06]--
+-04.0 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 0
+-04.1 Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMA Channel 1
The matrix of connection types between GPUs is shown as follows. It seems GPU0-GPU1 is one group and GPU2-GPU7 is the second group. They cannot correctly communicate.
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-7,16-23
GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-7,16-23
GPU2 SOC SOC X PIX PHB PHB PHB PHB 8-15,24-31
GPU3 SOC SOC PIX X PHB PHB PHB PHB 8-15,24-31
GPU4 SOC SOC PHB PHB X PIX PXB PXB 8-15,24-31
GPU5 SOC SOC PHB PHB PIX X PXB PXB 8-15,24-31
GPU6 SOC SOC PHB PHB PXB PXB X PIX 8-15,24-31
GPU7 SOC SOC PHB PHB PXB PXB PIX X 8-15,24-31