Performance drop when using the processes using different gpus on one machine

Hi,

I have two different workstations
The setup of the workstations looks like:
WS1
MSI x99 board (32GB ram)
i7 K 5820
2x GTX 970
NVidia driver 346.28
Ubuntu 14.10 (GNU/Linux 3.16.0-31-generic x86_64)
Cuda 7.0

WS2
MSI x99 board (32GB ram)
i7 K 5930
2x GTX 970
1x GTX 980
NVidia driver 346.47
Ubuntu 14.10 (GNU/Linux 3.16.0-31-generic x86_64)
Cuda 7.0

We don’t use the SLI connector (I mean the small bridge to connect cards).
I use cuda via caffe library. I run two processes using caffe on one machine but different cards.
I used nvidia-smi to see both processes runs on different gpus.
If I run one process then one iteration of my sample prgram takes nearly 5 minutes.
If I run both processes then the time per iteration is increasing up to 10 to 15 minutes and sometimes up to 50 minutes.
I made a profile via nvprof and the average, min and max times per functions are basically the same.

A sample the profile. I stopped the iteration after ~25 minutes because the sample program runs about 200 min and I don’t want to wait

one process             : 17.41%  289.804s     13320  21.757ms  6.8031ms  41.368ms  void cudnn::detail::convolve_dgrad_engine
two different processes : 17.42%  95.8076s      4413  21.710ms  6.7617ms  40.956ms  void cudnn::detail::convolve_dgrad_engine

It happens on both machines, using different driver versions. (Updated the WS1 to the latest cuda version and drivers and it makes not difference)

Any solutions for this problem?

Thank you!

seems similar to:

[url]https://devtalk.nvidia.com/default/topic/818054/cuda-programming-and-performance/running-two-instances-of-matlab-calling-mex-dll-files-which-use-different-gpus-on-the-same-pc/[/url]

i have not really tried to run multiple processes on the same machine
however, in addition to the suggestions in the link, it seems as if the best way to have multiple (processes as) instances is via MPI/ IPC or by having the primary process multi-thread
this should serve as a way to better control the cuda context