I am running a tiled Cholesky application on a Linux system with 4 x K40 and I am experiencing really low memory throughput of host <–> device transfers, even using pinned memory with cudaMemcpyAsync and streams.
I use 3 streams for data transfers (one for D2H, one for H2D transfers and one for D2D transfers) and several other streams for kernel launches.
For every kernel, first, the data chunks it needs are asynchronously transferred to the GPU and then, when the transfers are finished, the kernel is launched. All operations are asynchronous and I check for its completion with events. So, I can overlap data transfers with computations.
Launching the application with nvprof and later importing the output file to nvvp, I see the following:
When running with only 1 GPU, everything works as expected: kernels overlap with data transfers, and the average throughput for host <--> device data transfers is around 10 GB/s for pinned chunks of 33.5 MB.
When splitting the computation into 2 GPUs (I keep the same number of kernels, but this means that additional device to device transfers must be issued and they are actually issued in a separate stream, asynchronously): still data transfers are overlapped with kernels, but the throughput for host <--> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB. In this case, I see some data transfers (either H2D, D2H or D2D) that still get around 8 GB/s, which is what I expect, but for some reason, most transfers get a throughput lower than 400 MB/s. nvvp reports memory as pinned and shows the transfers in a stream different than 0. So, I have no idea why this happens.
Any ideas about this throughput slowdown? Does using 2 GPUs have some bad influence on memory bandwidth?
Please, let me know if you need further information, it’s my first post… :-)
nvvp tells me the throughput of each transfer as I pass the mouse over them. Also, I can click on the timeline to see the average transfer throughput, the number of transfers, …
I’ll try to attach a couple of screenshots for both 1 GPU and 2 GPU cases.
As you have 8 devices in your system, I assume you are using more than 8 streams then? Have you set CUDA_DEVICE_MAX_CONNECTIONS to the total number of streams you are using?
Are you using numactl to control CPU and memory affinity such that each GPU always communicates with the “near” CPU and memory? This is important in dual socket systems as NUMA effects can be quit pronounced.
Agree with njuffa. I suspect a topology issue. You can use numactl to pin your processes to sockets that are topologically “near” to the GPUs you want to use. You can also use taskset, which I find to be simpler semantically to quickly get a read on this.
If your GPUs 0 and 1 are not actually connected to the same socket, then even this won’t sort it out for you - you’ll need to be pretty confident of your understanding of your system topology. If you are absolutely certain they are connected to the same socket, then run your app with:
taskset -c 0 ./my_app
and modify the 0 above to different values up to the CPU core count of your system. You will find the mapping of logical cores to the physical socket that is “closest” to your GPUs.
But if your GPUs are not attached to the same socket, then the above will always yield a situation where one of the GPUs is favored.
Then, I think GPUs #0 and #1 are in the same socket, which is socket #0. Is this correct?
I’m using a custom library to create the threads and pin them to the appropriate core. But actually, I just checked that both threads are binded to cores belonging to socket #1, so this is completely wrong.
Since I’m using several threads and I don’t want to oversubscribe cores, I tried taskset to limit the execution of my application to socket #0 (taskset -c 0-11, as there are 12 cores per socket), instead of taskset -c 0 (it should have the same effect, right?), but I don’t see any memory bandwidth improvement.
Is it possible that the problem relies on the hardware (or any hardware-related issue) rather than on my application? I tried the same application on a similar machine (4 x K40s in the same socket) and the average memory throughput is around 8 GB/s with either 1 or 2 GPUs.
Not sitting in front of the system, it is difficult to offer much more than speculation.
What kind of system platform is this? What CPUs are being used, what is the motherboard vendor? I am wondering whether there could be a basic hardware limitation, e.g. insufficient number of PCIe lanes to feed two x16 interfaces per CPU, causing the links to be automatically downgraded when two GPUs are plugged in. Also, there have been issues in the past with various system BIOSes (SBIOS) when multiple Teslas are being used. Is the system running with the latest SBIOS available from the vendor? Have you checked the SBIOS configuration for any signs of possible misconfiguration.
If all efforts of trying to resolve this at the software level fail, you may want to contact your system vendor / system integrator to see if they can give advice on how to optimally configure it for four Teslas in the system.
Ok, I will contact the system administrator of the machine, as I’m just a user and I have no way to check everything you said :-( Your speculations are really welcome, thanks!