CUDA peer resources error when running on more than 8 K80s (AWS p2.16xlarge)

We are currently running Torch and TensorFlow on the p2.16xlarge instances on AWS.

When running examples on more than 8 K80s, we are getting errors from CUDA like:

cuda runtime error (60) : peer mapping resources exhausted:

$ echo "require 'cudnn' | th
THCudaCheck FAIL file=/tmp/torch/extra/cutorch/lib/THC/THCGeneral.c line=176 error=60 : peer mapping resourpython tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py --num_gpus=16ces exhausted
/program/torch/torch7/share/lua/5.1/trepl/init.lua:384: /program/torch/torch7/share/lua/5.1/trepl/init.lua:384: cuda runtime error (60) : peer mapping resources exhausted at /tmp/torch/extra/cutorch/lib/THC/THCGeneral.c:176

CUDA_ERROR_TOO_MANY_PEERS:

$ python tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py --num_gpus=16
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
...

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:0f.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x3d3f100
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:10.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x414fa00
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 2 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:11.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x455ffb0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 3 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:12.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4974090
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 4 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:13.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x4d8bc50
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 5 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:14.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x51a7320
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 6 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:15.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x55c68d0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 7 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:16.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x59e9560
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 8 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:17.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x5e0fcd0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 9 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:18.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6239f20
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 10 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:19.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6667c40
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 11 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1a.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6a99c40
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 12 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1b.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x6ecef20
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 13 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1c.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x7307ce0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 14 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1d.0
Total memory: 11.25GiB
Free memory: 11.13GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x7744590
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 15 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.13GiB
F tensorflow/core/common_runtime/gpu/gpu_init.cc:116] could not enable peer access for GPU devices: Internal: Internal: failed to enable peer access from 0x3f6f560 to 0x6662dc0: CUDA_ERROR_TOO_MANY_PEERS

Running simpleP2P shows all 16 GPUs have P2P access.

Running on Amazon Linux. These errors are the same for CUDA 7.5 + nvidia driver 352.99 and CUDA 8.0 + nvidia driver 367.48.

What is the meaning of these errors and any suggestions on how to further debug this?

We have the same exact problem. We are trying to run the mnist example.

The CUDA peer-to-peer system can use a maximum of 8 GPUs in a peer-to-peer (P2P) ensemble.

Since K80 consists of 2 GPU devices, a maximum of 4 K80’s could participate (i.e. all of their GPU devices could participate) in a single P2P ensemble.

16 GPUs could still be used in 2 separate ensembles, I believe. If I were doing this, I would run two separate applications/processes, and for each I would specify an appropriate 8-gpu-device-ensemble using a mechanism such as CUDA_VISIBLE_DEVICES environment variable.

Is there some basic limitation in PCIe itself, or the GPU hardware, that prevents more than eight peers per P2P ensemble, or is this a software limitation that may be lifted in the future? I suspect it is the former, but don’t really understand the hardware to that level of detail.

It’s at least in part a GPU hardware limitation. I don’t have a lot of technical details I can share publicly, but it has to do with the number of hardware “mailbox” resources available. A vague reference to hardware mailbox as a P2P resource is contained in this paper, for example:

[url]https://arxiv.org/pdf/1307.8276.pdf[/url]

It’s possible that its a limitation that might be lifted in future GPUs.

Thanks, that is good to know.

Does anyone know if the limit is higher for Pascal GPUs?

It is not higher for any currently available GPUs, including Pascal.

Worth re-trying these test with official nvidia NVIDIA-Linux-x86_64-352.99.run driver.

Our team was successful in running peer2peer between 16 different GPUs on P2.16XL with this driver (older drivers where limited to 8).

While every GPU can reach every other GPU, there is a limit that a given GPU could not communicate with more than 8 peer currently (i.e. only 8 concurrent cudaDeviceEnablePeerAccess(), and need to call cudaDeviceDisablePeerAccess for one of these 8 before enabling a 9th)

Yes, the p2p benchmarks provided in the CUDA SDK work fine, so basic connectivity between all pairs is not the problem.

The problem is that the 2 higher level NN libraries we are testing with (Torch and TF) exercise the system in a way that triggers this error. I am not aware of a way to selectively disable peer access between GPUs in either library.

I am happy to run further tests or provide any addl details, either in this thread or you can directly message me. We are still looking for ways to work around this limitation.

Have you tried to get this under control by using CUDA_VISIBLE_DEVICES as suggested by txbob?

Since there are indications that the limit of eight GPU per peer-to-peer group is hardware related and therefore unlikely to change in the very near future, I think your best medium-term way forward is to enter into a dialogue with the vendors of these libraries to see what can be done on the library side, either through configuration tricks with the existing software or through functionality extension to it.

Yes, our next step is to use the distributed training facilities in these libraries and then whitelist groups of GPUs with CUDA_VISIBLE_DEVICES for separate processes.

This will no doubt work, just a bummer that this added complexity is required in the single node case.

IMHO node configurators should keep such important non-trivial limitations in mind when configuring nodes. Cramming GPUs into a case with insufficient concern for scalability (just because you can) does not seem very sensible to me. GPU-accelerated node configurations offered by this particular vendor have puzzled me in the past, but then I do not know who their main customers for these nodes are.

Agreed.

Configuration headaches aside however, I would expect even without P2P access, the performance on one 16 GPU node will be better than two 8 GPU nodes due to better interconnectivity, but we are still working on verifying that with some benchmarks.

Its really depends on the exact workload characteristics and the underlying OS. Some OSes or workloads don’t do good job partitioning the memory well across dual-socket servers and one would end up with bottlenecks on the server memory access side for p2.16xl.

Another aspect is the inter-GPU bandwidth vs the bandwidth to the processor. If the applications is limited by the bandwidth between GPU and processor, two p2.8xl may perform better. If the application limited by GPU to GPU communication, the p2.16xl could perform better.

The limit of 8 peer connections is documented, BTW:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access[/url]

Ah, so it is, good find. Thanks!