Using NVlink bridge makes big impact on the training speed on 2x RTX 2080 (multi GPU training with p2p)

alexey.kostin · November 19, 2019, 11:36am

I thought it would be nice to share my experience with installing NVlink bridge.

We have a server with 8x RTX 2080Ti cards. I was experimenting with how well the training is scaling with increasing number of GPUs. The results I found on the internet are reported when batch size is increased with number of GPUs. That is if 1 GPU is used batch size is 64, then 2 it is 128 and so on. However in DNN training big batch sizes not always bring best results and I was interested to keep batch size 64 while distribution calculations across several GPUs. The results are greatly depend on the DNN model used. The bigger the model the less advantage to run across multiple GPUs. Also DNN framework plays very big part in the speed.

In my experiments I found the Nvidia branch of caffe and PyTorch give best results (beating tensorflow and mxnet). This is expected results and confirmed by many publications on the internet.

What I did not expect is that using NVlink bridge will make VERY significant impact. Here are results using nvcaffe:
single GPU 450 images/second
dual GPU via single PCIe switch 535 images/second
dual GPU via NVLinks (enabling P2P) 830 images/second

The model I train has massive last layer in order of 200K-300K outputs, so I believe this dictates lots of data need to be copied between GPUs hence fast link makes such impact.

florin.andrei · June 28, 2021, 10:29pm

Alexey,

Do you have a simple example of PyTorch code that actually uses NVLink? For example, looking at this simple demo code, what needs to change to use NVLink? (assuming multiple cards are available, and linked via NVLink)

https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html

kostinalexey · June 29, 2021, 10:48am

Hi Andrei,

The experiments I did with P2P were for caffe framework. Basically, the caffe was build using libraries which could benefit from having fast data exchange between GPU cards via NVlink (P2P). There is nothing specifically in protobuf configuration files you have to do to enable this in caffe all is done behind the scenes.

With PyTorch I would assume the same is true. Namely, if PyTorch knows how to use data excahnge directly between GPU cards (without copying to CPU memory) then all should work without the need for explicit python statements. I think with all modern builds of PyTorch P2P data exchange is enabled by default.

Here are some useful links for you
https://pytorch.org/docs/stable/distributed.html
https://pytorch.org/tutorials/intermediate/dist_tuto.html

Hope this helps,
Alexey

xguo · January 19, 2023, 1:50am

Hi Alexey,

Do you mind sharing your operating system info? I work on a Ubuntu system with two 3090 GPU and NVlink. It doesn’t seems to improve performance. I use the pyTorch DDP module.

You mentioned “enabling P2P”. Is there a switch that I need to turn on in Ubuntu for NVlink to work properly?

Thanks!

kostinalexey · January 19, 2023, 2:16pm

Hi xguo,

We used ubuntu 20.04. But I don’t have access to the system any more, sorry. NVlink helps when you need to copy lots of data between GPU cards. Normally this happens when you spread batch between GPUs and have lots of weights in the last layer of neural net. If amount of data is not big then effect of NVlink is barely noticeable.

Regards
Alexey

xguo · January 20, 2023, 12:44am

Thanks for your reply! I finally make it work after reinstalling the system and the drivers. Will post a step by step guide later.

Thanks again