Less Asynchronous Data Transfer/Kernel Overlap on a K40 than a GTX 770

MrVandemar · May 18, 2015, 8:32pm

I’ve been developing some CUDA code on a NVIDIA GTX 770. To obtain data transfer and kernel execution overlap, I implemented a Finite State Machine to correctly schedule these operations on four streams. Using these strategy it was possible to completely overlap the two. See http://imgur.com/DmGm5tG.

I then tried running the same code on a K40 and it seems there’s far less overlap on this device. See http://imgur.com/hJKn0js It almost appears as if the driver “waits” a bit before scheduling the data transfers. I was surprised because of the more advanced scheduling features and extra copy engine on the GK110 architecture.

I haven’t posted any code because it’s not simple to provide a minimal example. The essential pattern is

transfer data on
execute kernels
transfer a single floating point values off

I was wondering there any configuration options I need to select on the K40 to make it behave like the GTX 770 (Set up HyperQ etc.?)

thanks

little_jimmy · May 19, 2015, 5:03am

are the devices in the same host, or not?

MrVandemar · May 19, 2015, 8:00am

No they aren’t. In fact, the K40 is on an HPC machine (32 core E5-2690 0 @ 2.90GHz, 512GB RAM), shared by other users. Would it be reasonable to assume that the CPU memory bus could be saturated by other processes? However I do have exclusive access to the K40.

little_jimmy · May 19, 2015, 9:03am

as per njuffa:

“Are you using numactl to control CPU and memory affinity such that each GPU always communicates with the “near” CPU and memory?”

[url]https://devtalk.nvidia.com/default/topic/828002/?comment=4518117[/url]

MrVandemar · May 19, 2015, 1:50pm

I am not, thanks for highlighting this. However, I don’t think memory throughput is the issue – the bandwidth on the transfers above is 9.9GB/s.

little_jimmy · May 20, 2015, 5:13am

i presume you are using pinned memory for the device-to-host transfers; are you using pinned memory for the host-to-device transfers too?

do you have any statistics on the load on the host, and the host memory utility, given the mentioned multiple users?