P2P not working for P600s?
Hi, I have two K420s that I recently replaced with two P600s, but it appears that P2P is not working for the P600s. However, it does work for K420s. I was under the impression that P2P is supposed to work for identical cards, even GeForce cards. Has this policy changed? Here is the output from simpleP2P from the NVIDIA samples: [code] [root@metty simpleP2P]# ./simpleP2P [./simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 3 > GPU0 = "GeForce GTX 1050" IS capable of Peer-to-Peer (P2P) > GPU1 = " Quadro P600" IS capable of Peer-to-Peer (P2P) > GPU2 = " Quadro P600" IS capable of Peer-to-Peer (P2P) Checking GPU(s) for support of peer to peer memory access... > Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU1) : No > Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU2) : No > Peer access from Quadro P600 (GPU1) -> GeForce GTX 1050 (GPU0) : No > Peer access from Quadro P600 (GPU1) -> Quadro P600 (GPU2) : No > Peer access from Quadro P600 (GPU2) -> GeForce GTX 1050 (GPU0) : No > Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU1) : No Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P. Peer to Peer access is not available amongst GPUs in the system, waiving test. [/code] And some nvidia-smi output: [code] [root@metty simpleP2P]# nvidia-smi Tue Apr 3 13:57:59 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 Off | 00000000:05:00.0 Off | N/A | | 35% 40C P0 N/A / 75W | 0MiB / 1999MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro P600 Off | 00000000:0B:00.0 Off | N/A | | 36% 50C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Quadro P600 Off | 00000000:0C:00.0 Off | N/A | | 0% 67C P0 N/A / N/A | 0MiB / 2000MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [/code] [code] [root@metty simpleP2P]# nvidia-smi topo -m GPU0 GPU1 GPU2 CPU Affinity GPU0 X PHB PHB 0-5 GPU1 PHB X PIX 0-5 GPU2 PHB PIX X 0-5 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks [/code] [code] [root@metty simpleP2P]# nvidia-smi topo -p2p w GPU0 GPU1 GPU2 GPU0 X GNS GNS GPU1 GNS X GNS GPU2 GNS GNS X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown [/code] For the K420s, P2P works perfectly: [code] [root@metty p2pBandwidthLatencyTest]# nvidia-smi -L GPU 0: GeForce GTX 1050 (UUID: GPU-578cae79-a799-351b-1b29-157171e6af0d) GPU 1: Quadro K420 (UUID: GPU-30178a26-07b7-42a4-03bd-cf08253d89ae) GPU 2: Quadro K420 (UUID: GPU-f81abec5-ef46-4ff7-4216-2d1786323335) [/code] [code] [root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -m GPU0 GPU1 GPU2 CPU Affinity GPU0 X PHB PHB 0-5 GPU1 PHB X PIX 0-5 GPU2 PHB PIX X 0-5 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks [/code] [code] [root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -p2p rw GPU0 GPU1 GPU2 GPU0 X NS NS GPU1 NS X OK GPU2 NS OK X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown [root@metty p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, GeForce GTX 1050, pciBusID: 5, pciDeviceID: 0, pciDomainID:0 Device: 1, Quadro K420, pciBusID: b, pciDeviceID: 0, pciDomainID:0 Device: 2, Quadro K420, pciBusID: c, pciDeviceID: 0, pciDomainID:0 Device=0 CANNOT Access Peer Device=1 Device=0 CANNOT Access Peer Device=2 Device=1 CANNOT Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=2 CANNOT Access Peer Device=0 Device=2 CAN Access Peer Device=1 ... [/code] I'm using Linux kernel version 4.15, Nvidia driver 390.30 and CUDA 9.1, if that is of any interest. [b]EDIT[/b]: Just for curiosity, I tried with two K420s and one P600. [code] [root@metty simpleP2P]# ./simpleP2P [./simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 3 > GPU0 = " Quadro P600" IS capable of Peer-to-Peer (P2P) > GPU1 = " Quadro K420" IS capable of Peer-to-Peer (P2P) > GPU2 = " Quadro K420" IS capable of Peer-to-Peer (P2P) Checking GPU(s) for support of peer to peer memory access... > Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No > Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU2) : No > Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No > Peer access from Quadro K420 (GPU1) -> Quadro K420 (GPU2) : Yes > Peer access from Quadro K420 (GPU2) -> Quadro P600 (GPU0) : No > Peer access from Quadro K420 (GPU2) -> Quadro K420 (GPU1) : Yes Enabling peer access between GPU1 and GPU2... Checking GPU1 and GPU2 for UVA capabilities... > Quadro K420 (GPU1) supports UVA: Yes > Quadro K420 (GPU2) supports UVA: Yes Both GPUs can support UVA, enabling... Allocating buffers (64MB on GPU1, GPU2 and CPU Host)... Creating event handles... cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 5.64GB/s Preparing host buffer and memcpy to GPU1... Run kernel on GPU2, taking source data from GPU1 and writing to GPU2... Run kernel on GPU1, taking source data from GPU2 and writing to GPU1... Copy data back to host from GPU1 and verify results... Disabling peer access... Shutting down... Test passed [/code]
Hi,

I have two K420s that I recently replaced with two P600s, but it appears that P2P is not working for the P600s.
However, it does work for K420s.

I was under the impression that P2P is supposed to work for identical cards, even GeForce cards. Has this policy changed?


Here is the output from simpleP2P from the NVIDIA samples:
[root@metty simpleP2P]# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 3
> GPU0 = "GeForce GTX 1050" IS capable of Peer-to-Peer (P2P)
> GPU1 = " Quadro P600" IS capable of Peer-to-Peer (P2P)
> GPU2 = " Quadro P600" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU1) : No
> Peer access from GeForce GTX 1050 (GPU0) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU1) -> GeForce GTX 1050 (GPU0) : No
> Peer access from Quadro P600 (GPU1) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU2) -> GeForce GTX 1050 (GPU0) : No
> Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU1) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.


And some nvidia-smi output:
[root@metty simpleP2P]# nvidia-smi
Tue Apr 3 13:57:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 Off | 00000000:05:00.0 Off | N/A |
| 35% 40C P0 N/A / 75W | 0MiB / 1999MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P600 Off | 00000000:0B:00.0 Off | N/A |
| 36% 50C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P600 Off | 00000000:0C:00.0 Off | N/A |
| 0% 67C P0 N/A / N/A | 0MiB / 2000MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

[root@metty simpleP2P]# nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity
GPU0 X PHB PHB 0-5
GPU1 PHB X PIX 0-5
GPU2 PHB PIX X 0-5

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

[root@metty simpleP2P]# nvidia-smi topo -p2p w
GPU0 GPU1 GPU2
GPU0 X GNS GNS
GPU1 GNS X GNS
GPU2 GNS GNS X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown





For the K420s, P2P works perfectly:
[root@metty p2pBandwidthLatencyTest]# nvidia-smi -L
GPU 0: GeForce GTX 1050 (UUID: GPU-578cae79-a799-351b-1b29-157171e6af0d)
GPU 1: Quadro K420 (UUID: GPU-30178a26-07b7-42a4-03bd-cf08253d89ae)
GPU 2: Quadro K420 (UUID: GPU-f81abec5-ef46-4ff7-4216-2d1786323335)

[root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity
GPU0 X PHB PHB 0-5
GPU1 PHB X PIX 0-5
GPU2 PHB PIX X 0-5

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

[root@metty p2pBandwidthLatencyTest]# nvidia-smi topo -p2p rw
GPU0 GPU1 GPU2
GPU0 X NS NS
GPU1 NS X OK
GPU2 NS OK X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
[root@metty p2pBandwidthLatencyTest]# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1050, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro K420, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device: 2, Quadro K420, pciBusID: c, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CANNOT Access Peer Device=0
Device=2 CAN Access Peer Device=1
...


I'm using Linux kernel version 4.15, Nvidia driver 390.30 and CUDA 9.1, if that is of any interest.


EDIT: Just for curiosity, I tried with two K420s and one P600.
[root@metty simpleP2P]# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 3
> GPU0 = " Quadro P600" IS capable of Peer-to-Peer (P2P)
> GPU1 = " Quadro K420" IS capable of Peer-to-Peer (P2P)
> GPU2 = " Quadro K420" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU2) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU1) -> Quadro K420 (GPU2) : Yes
> Peer access from Quadro K420 (GPU2) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU2) -> Quadro K420 (GPU1) : Yes
Enabling peer access between GPU1 and GPU2...
Checking GPU1 and GPU2 for UVA capabilities...
> Quadro K420 (GPU1) supports UVA: Yes
> Quadro K420 (GPU2) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU1, GPU2 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 5.64GB/s
Preparing host buffer and memcpy to GPU1...
Run kernel on GPU2, taking source data from GPU1 and writing to GPU2...
Run kernel on GPU1, taking source data from GPU2 and writing to GPU1...
Copy data back to host from GPU1 and verify results...
Disabling peer access...
Shutting down...
Test passed

#1
Posted 04/03/2018 12:14 PM   
what does the deviceQuery sample output for your P600? It seems that the simpleP2P test program was unable to detect compute 2.0 or higher CUDA capability. So maybe there is something wrong with the P600 CUDA support in this driver?
what does the deviceQuery sample output for your P600?

It seems that the simpleP2P test program was unable to detect compute 2.0 or higher CUDA capability. So maybe there is something wrong with the P600 CUDA support in this driver?

#2
Posted 04/03/2018 02:35 PM   
I've copied out only for one P600, but the other one is identical: [code] Device 0: "Quadro P600" CUDA Driver Version / Runtime Version 9.1 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 2000 MBytes (2097479680 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 1557 MHz (1.56 GHz) Memory Clock rate: 2005 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 12 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > [/code]
I've copied out only for one P600, but the other one is identical:

Device 0: "Quadro P600"
CUDA Driver Version / Runtime Version 9.1 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 2000 MBytes (2097479680 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1557 MHz (1.56 GHz)
Memory Clock rate: 2005 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 12 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

#3
Posted 04/03/2018 02:40 PM   
Looks nominal to me. Your CUDA runtime version is 8.0. Have you considered upgrading to CUDA toolkit 9.1?
Looks nominal to me.

Your CUDA runtime version is 8.0. Have you considered upgrading to CUDA toolkit 9.1?

#4
Posted 04/03/2018 02:45 PM   
Sorry, my mistake. I have both versions installed. Here's the full output using the correct version. I've tried moving the GPUs around a bit, that's why they have different BDFs. As you can see, the P600s still report that they can't access each other using P2P. [code] [root@metty deviceQuery]# ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 3 CUDA Capable device(s) Device 0: "Quadro P600" CUDA Driver Version / Runtime Version 9.1 / 9.1 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 2000 MBytes (2097479680 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 1557 MHz (1.56 GHz) Memory Clock rate: 2005 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Quadro K420" CUDA Driver Version / Runtime Version 9.1 / 9.1 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 2000 MBytes (2096693248 bytes) ( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores GPU Max Clock rate: 876 MHz (0.88 GHz) Memory Clock rate: 891 Mhz Memory Bus Width: 128-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "Quadro P600" CUDA Driver Version / Runtime Version 9.1 / 9.1 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 2000 MBytes (2097479680 bytes) ( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores GPU Max Clock rate: 1557 MHz (1.56 GHz) Memory Clock rate: 2005 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 12 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No > Peer access from Quadro P600 (GPU0) -> Quadro P600 (GPU2) : No > Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No > Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU2) : No > Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU0) : No > Peer access from Quadro P600 (GPU2) -> Quadro K420 (GPU1) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 3 Result = PASS [/code]
Sorry, my mistake. I have both versions installed.

Here's the full output using the correct version. I've tried moving the GPUs around a bit, that's why they have different BDFs.
As you can see, the P600s still report that they can't access each other using P2P.

[root@metty deviceQuery]# ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 3 CUDA Capable device(s)

Device 0: "Quadro P600"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 2000 MBytes (2097479680 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1557 MHz (1.56 GHz)
Memory Clock rate: 2005 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro K420"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2000 MBytes (2096693248 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Max Clock rate: 876 MHz (0.88 GHz)
Memory Clock rate: 891 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Quadro P600"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 2000 MBytes (2097479680 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1557 MHz (1.56 GHz)
Memory Clock rate: 2005 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 12 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Quadro P600 (GPU0) -> Quadro K420 (GPU1) : No
> Peer access from Quadro P600 (GPU0) -> Quadro P600 (GPU2) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU0) : No
> Peer access from Quadro K420 (GPU1) -> Quadro P600 (GPU2) : No
> Peer access from Quadro P600 (GPU2) -> Quadro P600 (GPU0) : No
> Peer access from Quadro P600 (GPU2) -> Quadro K420 (GPU1) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 3
Result = PASS

#5
Posted 04/03/2018 02:55 PM   
Let's wait for the experts to chime in...
Let's wait for the experts to chime in...

#6
Posted 04/03/2018 03:38 PM   
Hi again, Based on the output from nvidia-smi, it would appear that GNS means that P2P does not work for the GPU at all and NS means that P2P works, but not between those GPUs. [code] [root@metty ~]# nvidia-smi topo -p2p w GPU0 GPU1 GPU2 GPU3 GPU0 X OK NS NS GPU1 OK X NS NS GPU2 NS NS X GNS GPU3 NS NS GNS X Legend: X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported U = Unknown [/code] According to the documentation for NVIDIA samples [1], P2P should generally be expected to work for similar GPUs, but the phrasing is a bit unclear: [center] [i]In general, P2P is supported between two same GPUs [b]with some exceptions, such as some Tesla and Quadro GPUs[/b].[/i] [/center] Does that mean that [.]P2P (generally) works for similar GeForce GPUs, but maybe not for some Quadros and Teslas?[/.] [.]or, that P2P (generally) works for similar GPUs, and some Quadros and Teslas in addition support some dissimilar GPUs?[/.] I looked around in the specs for the Pascal Quadros, and it appears that P2P may actually only be supported for higher-end Quadros: [.]The P4000 and "above" explicitly list GPDirect as a feature [2].[/.] [.]P600, however, does [i]not[/i] list GPUDirect as one of its features [3].[/.] I guess this means that there is no hope in trying to get P2P to work for the P600s, which I must admit is quite disappointing. [1] [url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-peer-to-peer-transfers-with-multi-gpu[/url] [2] [url]https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P4000-US-03Feb17.pdf[/url] [3] [url]https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P600-US-03Feb17.pdf[/url] Here is some additional info about the setup, for anyone stumbling across this thread: [code] [root@metty ~]# nvidia-smi Thu Apr 5 12:12:40 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro K420 Off | 00000000:05:00.0 Off | N/A | | 25% 50C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro K420 Off | 00000000:06:00.0 Off | N/A | | 26% 52C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Quadro P600 Off | 00000000:09:00.0 Off | N/A | | 34% 48C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Quadro P600 Off | 00000000:0A:00.0 Off | N/A | | 0% 65C P0 N/A / N/A | 0MiB / 2000MiB | 2% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [/code] [code] [root@metty ~]# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PIX PHB PHB 0-5 GPU1 PIX X PHB PHB 0-5 GPU2 PHB PHB X PIX 0-5 GPU3 PHB PHB PIX X 0-5 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks [/code] [b]EDIT:[/b] The kernel I'm running has some changes to the DMA AP for PCIe peer to peer, so I also attempted to boot an older kernel (3.10.0) but the result was the same: The K420s are able to do P2P, while the P600s are not. In addition, I've also tried two GTX 750s, they report themselves as GNS.
Hi again,

Based on the output from nvidia-smi, it would appear that GNS means that P2P does not work for the GPU at all and NS means that P2P works, but not between those GPUs.

[root@metty ~]# nvidia-smi topo -p2p w
GPU0 GPU1 GPU2 GPU3
GPU0 X OK NS NS
GPU1 OK X NS NS
GPU2 NS NS X GNS
GPU3 NS NS GNS X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown


According to the documentation for NVIDIA samples [1], P2P should generally be expected to work for similar GPUs, but the phrasing is a bit unclear:

In general, P2P is supported between two same GPUs with some exceptions, such as some Tesla and Quadro GPUs.

Does that mean that
  • P2P (generally) works for similar GeForce GPUs, but maybe not for some Quadros and Teslas?
  • or, that P2P (generally) works for similar GPUs, and some Quadros and Teslas in addition support some dissimilar GPUs?

  • I looked around in the specs for the Pascal Quadros, and it appears that P2P may actually only be supported for higher-end Quadros:
  • The P4000 and "above" explicitly list GPDirect as a feature [2].
  • P600, however, does not list GPUDirect as one of its features [3].

  • I guess this means that there is no hope in trying to get P2P to work for the P600s, which I must admit is quite disappointing.

    [1] http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-peer-to-peer-transfers-with-multi-gpu
    [2] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P4000-US-03Feb17.pdf
    [3] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/documents/Quadro-P600-US-03Feb17.pdf


    Here is some additional info about the setup, for anyone stumbling across this thread:
    [root@metty ~]# nvidia-smi
    Thu Apr 5 12:12:40 2018
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.30 Driver Version: 390.30 |
    |-------------------------------+----------------------+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 Quadro K420 Off | 00000000:05:00.0 Off | N/A |
    | 25% 50C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default |
    +-------------------------------+----------------------+----------------------+
    | 1 Quadro K420 Off | 00000000:06:00.0 Off | N/A |
    | 26% 52C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default |
    +-------------------------------+----------------------+----------------------+
    | 2 Quadro P600 Off | 00000000:09:00.0 Off | N/A |
    | 34% 48C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default |
    +-------------------------------+----------------------+----------------------+
    | 3 Quadro P600 Off | 00000000:0A:00.0 Off | N/A |
    | 0% 65C P0 N/A / N/A | 0MiB / 2000MiB | 2% Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes: GPU Memory |
    | GPU PID Type Process name Usage |
    |=============================================================================|
    | No running processes found |
    +-----------------------------------------------------------------------------+


    [root@metty ~]# nvidia-smi topo -m
    GPU0 GPU1 GPU2 GPU3 CPU Affinity
    GPU0 X PIX PHB PHB 0-5
    GPU1 PIX X PHB PHB 0-5
    GPU2 PHB PHB X PIX 0-5
    GPU3 PHB PHB PIX X 0-5

    Legend:

    X = Self
    SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
    PIX = Connection traversing a single PCIe switch
    NV# = Connection traversing a bonded set of # NVLinks


    EDIT: The kernel I'm running has some changes to the DMA AP for PCIe peer to peer, so I also attempted to boot an older kernel (3.10.0) but the result was the same: The K420s are able to do P2P, while the P600s are not.

    In addition, I've also tried two GTX 750s, they report themselves as GNS.

    #7
    Posted 04/05/2018 10:41 AM   
    [quote] [...] it appears that P2P may actually only be supported for higher-end Quadros[/quote] As someone who has used a number of low-end Quadros from the Fermi through the Pascal generations this strikes me as a correct assessment. The unfortunate part is that NVIDIA has (to my knowledge) never provided a handy table that shows which Quadro models their high-end features are limited to. One either has to dig through various online specifications, or find out by trying with the actual hardware, as you have done here. The K40s you used previously are clearly highest-end GPUs from the Kepler family, so it is not surprising that these high-priced cards come with all the bells and whistles NVIDIA has to offer.
    [...] it appears that P2P may actually only be supported for higher-end Quadros

    As someone who has used a number of low-end Quadros from the Fermi through the Pascal generations this strikes me as a correct assessment.

    The unfortunate part is that NVIDIA has (to my knowledge) never provided a handy table that shows which Quadro models their high-end features are limited to. One either has to dig through various online specifications, or find out by trying with the actual hardware, as you have done here.

    The K40s you used previously are clearly highest-end GPUs from the Kepler family, so it is not surprising that these high-priced cards come with all the bells and whistles NVIDIA has to offer.

    #8
    Posted 04/05/2018 07:12 PM   
    Scroll To Top

    Add Reply