I’ve been developing some CUDA code on a NVIDIA GTX 770. To obtain data transfer and kernel execution overlap, I implemented a Finite State Machine to correctly schedule these operations on four streams. Using these strategy it was possible to completely overlap the two. See http://imgur.com/DmGm5tG.
I then tried running the same code on a K40 and it seems there’s far less overlap on this device. See http://imgur.com/hJKn0js It almost appears as if the driver “waits” a bit before scheduling the data transfers. I was surprised because of the more advanced scheduling features and extra copy engine on the GK110 architecture.
I haven’t posted any code because it’s not simple to provide a minimal example. The essential pattern is
- transfer data on
- execute kernels
- transfer a single floating point values off
I was wondering there any configuration options I need to select on the K40 to make it behave like the GTX 770 (Set up HyperQ etc.?)
thanks