Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown

My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system.

The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail.

This causes the code to run much slower than it does on a 4 GPU instance.

When profiled, I get the message:
==8804== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation

I believe this is a DGX-1 Station, and I’m running a Win 2016 OS with SDK 9.1 and the latest driver as of mid January.

The profiler shows a kernel execution time which is 1ms when not accessing UVM, takes over 900ms when trying to enable UVM.

The same kernel takes 2ms on a 4 GPU (p3.8xlarge) instance (with UVM), processing twice as much data.

Does anyone have any idea why 3 of 7 GPUs fail the peer access test?

which driver version exactly are you using?
are all GPUs in TCC mode as reported by nvidia-smi?

The machines are not DGX-1 (nor are they DGX-Station. There is no such thing as DGX-1 Station).

NVidia DGX-1 8X V100 Essential Instrument for AI Research | NVIDIA DGX-1

AFAIK, the only other 8x Volta system with NVLink is IBM’s PowerPC system.

All GPUs have the same driver and at least one has been verified to be TCC mode.

The driver is 388.19, and I just tried to upgrade it to 390.65, but the install failed.

we recommend only r390 or higher drivers with CUDA 9.1 on AWS p3 instances.

It seems that this is expected behavior.

The platform in question is not actually a DGX-1V but topology-wise it is similar.

That topology has 4 GPUs connected via PCIE to one CPU socket, and 4 GPUs connected via PCIE to the other CPU socket.

The NVLINK topology is the same hybrid cube mesh published for DGX-1V. This means that each GPU has 4 neighbors that it can access via a single hop over NVLINK (it cannot access all 7 GPUs in the hybrid cube mesh via a single hop - 3 of the remote neighbors would require 2 hops).

The P2P enablement system will first use the NVLINK peers (single-hop) for enablement. If those don’t exist for the requested pairing, it will attempt to do so over PCIE, if the topology permits. (Currently, it is not possible to establish P2P peering over NVLINK if the GPUs in question are not connected via a single hop.)

However this DGX-1V like topology has the 3 remote neighbors attached to the “other” socket, meaning PCIE peering would have to travel over the QPI link between sockets. This is a slow path and generally not supported for PCIE P2P anyway.

So in a nutshell, you are witnessing expected behavior. That platform/topology will support a max of 4 peer connections.

Are you saying that it is not possible with any current system to enable Peer-to-Peer over NVLink between more than 4 GPUs?

I find it odd that merely trying to enable and use UVM between GPUs would drastically slow down the kernel execution (according to NVProf) which is writing to GPU global memory on a single GPU.

Not quite, but something like that. For this particular hybrid cube mesh topology, with 4 GPUs connected to one socket and 4 to the other, yes, my comments apply to that. Nowhere in my comments did I say “any current system”. Different systems have different topologies. I haven’t done a summary or survey across all “current systems”.

However we can draw some logical upper bounds. P2P across NVLINK, today, requires a one-hop connection. The V100 Volta GPU has a complement of 6 NVLINK “bricks” or “links” that can be used to connect to other devices. Therefore, under current considerations/assumptions, I would not expect that anyone could design a system that allowed for more than 6 P2P connections over NVLINK, to a single GPU. 7 (or higher) should not be achievable.

OTOH, if the system design also had PCIE connections to all GPUs connected to the same PCIE fabric (presumably via some kind of switch tree) then P2P would be possible between any 2 GPUs- albeit not using NVLINK in all cases, as stated in your question. There do exist systems that have 8 GPUs on the same PCIE switch/fabric.

There is usually an explanation for everything. If you are enabling P2P connections but not making use of those GPUs in any way (which would be odd) then I would find that observation odd, as well. However presumably you are making use of more than 1 GPU. The way in which you are using multiple GPUs probably contains the answer to the conundrum.

txbob, thanks for your responses!

Each of the 8 GPUs are executing the same kernels, reading and writing to GPU global memory.

When they all finish, GPU 0 attempts to copy buffers from 7 other GPUs to buffers on GPU 0, and then sum them all together.

The same configuration using 4 GPUs works ~4x as fast as a on a single GPU, but failing the Peer enable on 8 GPUs apparently causes the kernel to slow from what should be 1ms to over 900ms.

It seems trying and failing to enable Peer access does something to what should be independent GPU kernel execution.

are you using unified memory/managed memory?

I’ve allocated all GPU global memory with cudaMallocManaged except for inter kernel global storage.

then the answer is contained in the profiler message you already posted:

“When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation

This is particularly true in a windows regime under CUDA 9.0/9.1, where demand-paged managed memory is not available.

If that is the root of the issue, then I would expect that by refactoring to cudaMalloc, instead of cudaMallocManaged, for the device memory allocations used by the kernel, you should be able to restore approximately “full speed” kernel operation, even in the 8 GPU case. This doesn’t address any other considerations of your code that I am not aware of, nor am I addressing functionality, behavior, or performance, of the final reduction step where the results from 8 GPUs are combined.

OK, thanks txbob. The kernel is running 900x slower.

Apparently I just need to stop requesting unavailable peer mappings. But I will also stop using managed memory for inputs since that’s not even necessary.

Seems odd that a request failure would cause a GPU wide default and the zero-copy memory is so drastically slower.

I removed all the failed peer mapping requests, and all the cudaMallocManaged calls, but kernel execution time is still as slow as before.

And when requesting peer mapping for GPU 0, it succeeds for GPUs 1-4, so does that mean there are 5 GPUs on that CPU node?

I was able to map 1-3 (or 4) to GPU 0 and 5-7 to GPU 4.

I won’t be able to explain the performance of the application. I have no visibility into it.

Regarding the peer mapping, the hybrid cube mesh places 4 GPUs as one-hop NVLINK neighbors to a particular GPU. Not all of these are connected via PCIE to a particular CPU socket. It’s probably best if you study a hybrid cube mesh diagram:

[url]https://devblogs.nvidia.com/inside-pascal/[/url]

click this link below:
[url]https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/8-GPU-hybrid-cube-mesh-624x424.png[/url]

every green arrow head on a particular GPU represents a connection to a one-hop neighbor. Notice each GPU has 4 green arrow heads pointing to it.

I have run my code configured to use only 4 GPUs on the p3.16xlarge instance which runs very fast on the p3.8xlarge instance. The result is the same glacial performance as before.

I have concluded that:

The problem is detailed in the profiler warning:

==7112== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation

AWS p3.16xlarge instance contains “devices without peer-to-peer support” which is causing UVM to default to VERY slow “zero-copy memory,” which is causing my code to run VERY slow, apparently because I am enabling Peer-to-Peer data links with managed memory allocations. The numerical results are the same as the p3.8xlarge instance.

This does not occur on AWS p3.8xlarge instance because it does not contain “devices without peer-to-peer support.”

If you set the CUDA_VISIBLE_DEVICES environment variable to appropriately include only the 5 GPUs in question, (and none of the non-peer-mappable GPUs) you may be able to get that case to run “not glacially slow”

According to this: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Technical Blog

you are correct that CUDA_VISIBLE_DEVICES will enable me to run at full speed on 4 of the 8 GPUs. However, I have already verified that my code runs fast on 4 GPUs. Thanks for that suggestion.

What I need is for NVidia/AWS to provide a solution that allows me to utilize UVM and Peer-to-Peer at full speed on an 8 GPU system.

Any suggestion on how to get this fixed?

I was suggesting you might be able to get to 5 of 8 GPUs with that environment variable. The “master” GPU plus its 4 P2P peers.

There isn’t anything broken here. By that I mean the system design is behaving as expected, and the system design will inherently not allow all 8 GPUs to enter into a P2P clique. That limitation has implications for UM behavior.

To “fix” this using current hardware/software methodologies would require a system (HW) design that allows all 8 GPUs to enter into the same P2P clique.

As discussed previously, it’s not possible, with currently available NVLINK HW, to allow 8 GPUs to enter into an 8-way P2P clique supported by NVLINK. This is because a Volta V100 has only 6 NVLINK connections, so, using today’s available technology, can only have up to 6 (“one-hop”) peers. Thus, with a modified HW design, you could conceive of a platform where a maximum of 7 V100 devices could be in the same P2P clique. But I know of no system HW design today that adheres to this (a “fully connected” NVLINK mesh).

With respect to an 8-way P2P clique supported by PCIE, instead of NVLINK, such systems exist and are available today. This particular HW design that AWS has used for their P3 instances is not one of them.

Having said all that, I suspect it might be possible to refactor your application to get approximately full performance, in this particular AWS p3 8 GPU setup. It may or may not involve managed memory. It probably would not involve P2P in an 8-way clique.

BTW, the nccl library is designed to allow efficient collective communications amongst GPUs in a single system, with or without UM, with or without the ability for all GPUs to simultaneously enter into the same P2P clique.

Thanks for your detailed suggestions txbob.

I certainly don’t know what’s involved in eliminating the zero copy fallback just because non-NVLinked GPUs exist in the system. It just seems like it should be possible without changes to the HW when those links are not even requested enabled.

I will investigate nccl and other solutions.

Unfortunately NCCL is currently Linux only and I would have to invest significant time converting my Windows code just to determine whether there is any speed advantage over 4-5 GPUs.