Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown
My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system. The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail. This causes the code to run much slower than it does on a 4 GPU instance. When profiled, I get the message: ==8804== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory I believe this is a DGX-1 Station, and I'm running a Win 2016 OS with SDK 9.1 and the latest driver as of mid January. The profiler shows a kernel execution time which is 1ms when not accessing UVM, takes over 900ms when trying to enable UVM. The same kernel takes 2ms on a 4 GPU (p3.8xlarge) instance (with UVM), processing twice as much data. Does anyone have any idea why 3 of 7 GPUs fail the peer access test?
My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system.

The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail.

This causes the code to run much slower than it does on a 4 GPU instance.

When profiled, I get the message:
==8804== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

I believe this is a DGX-1 Station, and I'm running a Win 2016 OS with SDK 9.1 and the latest driver as of mid January.

The profiler shows a kernel execution time which is 1ms when not accessing UVM, takes over 900ms when trying to enable UVM.

The same kernel takes 2ms on a 4 GPU (p3.8xlarge) instance (with UVM), processing twice as much data.

Does anyone have any idea why 3 of 7 GPUs fail the peer access test?

#1
Posted 02/06/2018 11:37 PM   
which driver version exactly are you using? are all GPUs in TCC mode as reported by nvidia-smi? The machines are not DGX-1 (nor are they DGX-Station. There is no such thing as DGX-1 Station).
which driver version exactly are you using?
are all GPUs in TCC mode as reported by nvidia-smi?

The machines are not DGX-1 (nor are they DGX-Station. There is no such thing as DGX-1 Station).

#2
Posted 02/07/2018 02:35 AM   
NVidia DGX-1 8X V100 https://www.nvidia.com/en-us/data-center/dgx-1/ AFAIK, the only other 8x Volta system with NVLink is IBM's PowerPC system. All GPUs have the same driver and at least one has been verified to be TCC mode. The driver is 388.19, and I just tried to upgrade it to 390.65, but the install failed.
NVidia DGX-1 8X V100 https://www.nvidia.com/en-us/data-center/dgx-1/


AFAIK, the only other 8x Volta system with NVLink is IBM's PowerPC system.

All GPUs have the same driver and at least one has been verified to be TCC mode.

The driver is 388.19, and I just tried to upgrade it to 390.65, but the install failed.

#3
Posted 02/07/2018 04:19 AM   
we recommend only r390 or higher drivers with CUDA 9.1 on AWS p3 instances.
we recommend only r390 or higher drivers with CUDA 9.1 on AWS p3 instances.

#4
Posted 02/07/2018 10:15 AM   
It seems that this is expected behavior. The platform in question is not actually a DGX-1V but topology-wise it is similar. That topology has 4 GPUs connected via PCIE to one CPU socket, and 4 GPUs connected via PCIE to the other CPU socket. The NVLINK topology is the same hybrid cube mesh published for DGX-1V. This means that each GPU has 4 neighbors that it can access via a single hop over NVLINK (it cannot access all 7 GPUs in the hybrid cube mesh via a single hop - 3 of the remote neighbors would require 2 hops). The P2P enablement system will first use the NVLINK peers (single-hop) for enablement. If those don't exist for the requested pairing, it will attempt to do so over PCIE, if the topology permits. (Currently, it is not possible to establish P2P peering over NVLINK if the GPUs in question are not connected via a single hop.) However this DGX-1V like topology has the 3 remote neighbors attached to the "other" socket, meaning PCIE peering would have to travel over the QPI link between sockets. This is a slow path and generally not supported for PCIE P2P anyway. So in a nutshell, you are witnessing expected behavior. That platform/topology will support a max of 4 peer connections.
It seems that this is expected behavior.

The platform in question is not actually a DGX-1V but topology-wise it is similar.

That topology has 4 GPUs connected via PCIE to one CPU socket, and 4 GPUs connected via PCIE to the other CPU socket.

The NVLINK topology is the same hybrid cube mesh published for DGX-1V. This means that each GPU has 4 neighbors that it can access via a single hop over NVLINK (it cannot access all 7 GPUs in the hybrid cube mesh via a single hop - 3 of the remote neighbors would require 2 hops).

The P2P enablement system will first use the NVLINK peers (single-hop) for enablement. If those don't exist for the requested pairing, it will attempt to do so over PCIE, if the topology permits. (Currently, it is not possible to establish P2P peering over NVLINK if the GPUs in question are not connected via a single hop.)

However this DGX-1V like topology has the 3 remote neighbors attached to the "other" socket, meaning PCIE peering would have to travel over the QPI link between sockets. This is a slow path and generally not supported for PCIE P2P anyway.

So in a nutshell, you are witnessing expected behavior. That platform/topology will support a max of 4 peer connections.

#5
Posted 02/07/2018 04:35 PM   
Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs? I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.
Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs?

I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.

#6
Posted 02/07/2018 05:02 PM   
[quote=""]Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs?[/quote] Not quite, but something like that. For this particular hybrid cube mesh topology, with 4 GPUs connected to one socket and 4 to the other, yes, my comments apply to that. Nowhere in my comments did I say "any current system". Different systems have different topologies. I haven't done a summary or survey across all "current systems". However we can draw some logical upper bounds. P2P across NVLINK, today, requires a one-hop connection. The V100 Volta GPU has a complement of 6 NVLINK "bricks" or "links" that can be used to connect to other devices. Therefore, under current considerations/assumptions, I would not expect that anyone could design a system that allowed for more than 6 P2P connections over NVLINK, to a single GPU. 7 (or higher) should not be achievable. OTOH, if the system design also had PCIE connections to all GPUs connected to the same PCIE fabric (presumably via some kind of switch tree) then P2P would be possible between any 2 GPUs- albeit not using NVLINK in all cases, as stated in your question. There do exist systems that have 8 GPUs on the same PCIE switch/fabric. [quote=""] I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU. [/quote] There is usually an explanation for everything. If you are enabling P2P connections but not making use of those GPUs in any way (which would be odd) then I would find that observation odd, as well. However presumably you are making use of more than 1 GPU. The way in which you are using multiple GPUs probably contains the answer to the conundrum.
said:Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs?


Not quite, but something like that. For this particular hybrid cube mesh topology, with 4 GPUs connected to one socket and 4 to the other, yes, my comments apply to that. Nowhere in my comments did I say "any current system". Different systems have different topologies. I haven't done a summary or survey across all "current systems".

However we can draw some logical upper bounds. P2P across NVLINK, today, requires a one-hop connection. The V100 Volta GPU has a complement of 6 NVLINK "bricks" or "links" that can be used to connect to other devices. Therefore, under current considerations/assumptions, I would not expect that anyone could design a system that allowed for more than 6 P2P connections over NVLINK, to a single GPU. 7 (or higher) should not be achievable.

OTOH, if the system design also had PCIE connections to all GPUs connected to the same PCIE fabric (presumably via some kind of switch tree) then P2P would be possible between any 2 GPUs- albeit not using NVLINK in all cases, as stated in your question. There do exist systems that have 8 GPUs on the same PCIE switch/fabric.

said:
I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.


There is usually an explanation for everything. If you are enabling P2P connections but not making use of those GPUs in any way (which would be odd) then I would find that observation odd, as well. However presumably you are making use of more than 1 GPU. The way in which you are using multiple GPUs probably contains the answer to the conundrum.

#7
Posted 02/07/2018 05:17 PM   
txbob, thanks for your responses! Each of the 8 GPUs are executing the same kernels, reading and writing to GPU global memory. When they all finish, GPU 0 attempts to copy buffers from 7 other GPUs to buffers on GPU 0, and then sum them all together. The same configuration using 4 GPUs works ~4x as fast as a on a single GPU, but failing the Peer enable on 8 GPUs apparently causes the kernel to slow from what should be 1ms to over 900ms. It seems trying and failing to enable Peer access does something to what should be independent GPU kernel execution.
txbob, thanks for your responses!

Each of the 8 GPUs are executing the same kernels, reading and writing to GPU global memory.

When they all finish, GPU 0 attempts to copy buffers from 7 other GPUs to buffers on GPU 0, and then sum them all together.

The same configuration using 4 GPUs works ~4x as fast as a on a single GPU, but failing the Peer enable on 8 GPUs apparently causes the kernel to slow from what should be 1ms to over 900ms.

It seems trying and failing to enable Peer access does something to what should be independent GPU kernel execution.

#8
Posted 02/07/2018 06:25 PM   
are you using unified memory/managed memory?
are you using unified memory/managed memory?

#9
Posted 02/07/2018 07:11 PM   
I've allocated all GPU global memory with cudaMallocManaged except for inter kernel global storage.
I've allocated all GPU global memory with cudaMallocManaged except for inter kernel global storage.

#10
Posted 02/07/2018 07:15 PM   
then the answer is contained in the profiler message you already posted: "When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory" This is particularly true in a windows regime under CUDA 9.0/9.1, where demand-paged managed memory is not available. If that is the root of the issue, then I would expect that by refactoring to cudaMalloc, instead of cudaMallocManaged, for the device memory allocations used by the kernel, you should be able to restore approximately "full speed" kernel operation, even in the 8 GPU case. This doesn't address any other considerations of your code that I am not aware of, nor am I addressing functionality, behavior, or performance, of the final reduction step where the results from 8 GPUs are combined.
then the answer is contained in the profiler message you already posted:

"When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory"

This is particularly true in a windows regime under CUDA 9.0/9.1, where demand-paged managed memory is not available.

If that is the root of the issue, then I would expect that by refactoring to cudaMalloc, instead of cudaMallocManaged, for the device memory allocations used by the kernel, you should be able to restore approximately "full speed" kernel operation, even in the 8 GPU case. This doesn't address any other considerations of your code that I am not aware of, nor am I addressing functionality, behavior, or performance, of the final reduction step where the results from 8 GPUs are combined.

#11
Posted 02/07/2018 07:18 PM   
OK, thanks txbob. The kernel is running 900x slower. Apparently I just need to stop requesting unavailable peer mappings. But I will also stop using managed memory for inputs since that's not even necessary. Seems odd that a request failure would cause a GPU wide default and the zero-copy memory is so drastically slower.
OK, thanks txbob. The kernel is running 900x slower.

Apparently I just need to stop requesting unavailable peer mappings. But I will also stop using managed memory for inputs since that's not even necessary.

Seems odd that a request failure would cause a GPU wide default and the zero-copy memory is so drastically slower.

#12
Posted 02/07/2018 07:46 PM   
I removed all the failed peer mapping requests, and all the cudaMallocManaged calls, but kernel execution time is still as slow as before. And when requesting peer mapping for GPU 0, it succeeds for GPUs 1-4, so does that mean there are 5 GPUs on that CPU node? I was able to map 1-3 (or 4) to GPU 0 and 5-7 to GPU 4.
I removed all the failed peer mapping requests, and all the cudaMallocManaged calls, but kernel execution time is still as slow as before.

And when requesting peer mapping for GPU 0, it succeeds for GPUs 1-4, so does that mean there are 5 GPUs on that CPU node?

I was able to map 1-3 (or 4) to GPU 0 and 5-7 to GPU 4.

#13
Posted 02/07/2018 11:31 PM   
I won't be able to explain the performance of the application. I have no visibility into it. Regarding the peer mapping, the hybrid cube mesh places 4 GPUs as one-hop NVLINK neighbors to a particular GPU. Not all of these are connected via PCIE to a particular CPU socket. It's probably best if you study a hybrid cube mesh diagram: [url]https://devblogs.nvidia.com/inside-pascal/[/url] click this link below: [url]https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/8-GPU-hybrid-cube-mesh-624x424.png[/url] every green arrow head on a particular GPU represents a connection to a one-hop neighbor. Notice each GPU has 4 green arrow heads pointing to it.
I won't be able to explain the performance of the application. I have no visibility into it.

Regarding the peer mapping, the hybrid cube mesh places 4 GPUs as one-hop NVLINK neighbors to a particular GPU. Not all of these are connected via PCIE to a particular CPU socket. It's probably best if you study a hybrid cube mesh diagram:

https://devblogs.nvidia.com/inside-pascal/

click this link below:
https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/8-GPU-hybrid-cube-mesh-624x424.png

every green arrow head on a particular GPU represents a connection to a one-hop neighbor. Notice each GPU has 4 green arrow heads pointing to it.

#14
Posted 02/07/2018 11:36 PM   
I have run my code configured to use only 4 GPUs on the p3.16xlarge instance which runs very fast on the p3.8xlarge instance. The result is the same glacial performance as before. I have concluded that: The problem is detailed in the profiler warning: ==7112== Warning: Unified Memory Profiling is not supported on the current configuration because a [b]pair of devices without peer-to-peer support is detected on this multi-GPU setup.[/b] When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory [b]AWS p3.16xlarge instance contains "devices without peer-to-peer support"[/b] which is causing UVM to default to VERY slow "zero-copy memory," which is causing my code to run VERY slow, apparently because I am enabling Peer-to-Peer data links with managed memory allocations. [b]The numerical results are the same as the p3.8xlarge instance.[/b] This does not occur on AWS p3.8xlarge instance because [b]it does not contain "devices without peer-to-peer support."[/b]
I have run my code configured to use only 4 GPUs on the p3.16xlarge instance which runs very fast on the p3.8xlarge instance. The result is the same glacial performance as before.

I have concluded that:

The problem is detailed in the profiler warning:

==7112== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

AWS p3.16xlarge instance contains "devices without peer-to-peer support" which is causing UVM to default to VERY slow "zero-copy memory," which is causing my code to run VERY slow, apparently because I am enabling Peer-to-Peer data links with managed memory allocations. The numerical results are the same as the p3.8xlarge instance.

This does not occur on AWS p3.8xlarge instance because it does not contain "devices without peer-to-peer support."

#15
Posted 02/08/2018 05:57 PM   
Scroll To Top

Add Reply