Multi GPU results in latencies in Linux

alex_jin36 · April 13, 2012, 12:53am

I am currently testing a software that does work with several CUDA GPUs.
The code contains supports both Linux and Windows.
What has been baffling me is the contrary results between Linux and Win.

I ran Cuda-Profiler on both systems (uploaded in the attachments).
We can see that in Windows it runs just as expected (the second CUDA is slower because it’s a slower card):
each thread calls are nicely packed together and thus increases the overall efficiency as expected.

However, in Linux, the GPUs seem to have a hard time running threads in parallel:

They are executed in serial, such as the first three blocks
They are called in parallel but the are spread out and overall calculation time of the threads are about the same as in serial, such as the six blocks.

Has anyone else experienced a similar problem? If so, is there anything I could do about this?

alexish · April 15, 2012, 11:51am

I have a multi GPU code working on linux. We used the profiler in command line to reconstruct the time line and everything was fine including mem copy and compute kernel overlapping. After that i used the visual profiler and notice exactly your problem: the launch are serialized and the profiler say that is no memory transfer and kernel overlapping. At this point i suppose that is a bug in the visual profiler.

Justin_Luitjens · April 16, 2012, 2:50pm

Can you update to 4.1 and try using nvvp? NVVP will also show you the api calls which might reveal where the extra synchronization is coming from.

alexish · April 18, 2012, 12:50pm

I made i try with 4.1RC2 and that solve my kernel and memory transfer overlapping issue: now the profiler shows overlapped transfers and kernel as expected.
By the way i have another issue: “Event/metric collect failed” kernels behaving differently between runs …
I will open another post for that.

CarcaraH · April 25, 2012, 3:04am

I’m facing something similar.

On my single gpu system, cuda 4.1, 295.20 driver, gcc 446, the driver call, memcopy, etc, has time 0.070 seconds.

On my multigpy system, cuda 4.2, 295.45, gcc 4.5.3, 1,070 second!!

Does anybody know is there an driver issue or something like that in

I am currently testing a software that does work with several CUDA GPUs.

The code contains supports both Linux and Windows.

What has been baffling me is the contrary results between Linux and Win.

I ran Cuda-Profiler on both systems (uploaded in the attachments).

We can see that in Windows it runs just as expected (the second CUDA is slower because it’s a slower card):

each thread calls are nicely packed together and thus increases the overall efficiency as expected.

However, in Linux, the GPUs seem to have a hard time running threads in parallel:

They are executed in serial, such as the first three blocks

They are called in parallel but the are spread out and overall calculation time of the threads are about the same as in serial, such as the six blocks.

Has anyone else experienced a similar problem? If so, is there anything I could do about this?