HI!
Playing with CUDA i’ve reached streams. And now i’m trying to get the maximum of streams ideology.
Till now i was able to pack my 3-stream processing by combining kernel operations and memory operations.
Using GeForce 650 GTX i believe it should have 2 memory devices and should be able to perform H2D and D2H operations simultaneously.
What do i do: i have created 3 streams, allocated 3 memory blocks in pinned memory.
Now i do the cycle. Every iteration i order every stream to perform next operation asynchronously.
H2D operation for (i)-th stream
kernel procedure for (i-1)-th stream
D2H operation for (i-2)-th stream.
This cycle means that every stream will perform H2D,kernel,D2H operations by cycle.
Monitored the system with NSight VS Edition. To make it all more visible i have made memory transfer operations long enough (512Kb) and kernel procedure performing lots of local floating point operations for every thread. This way it is easier to see timeline.
As i can see my memory operations take the whole timeline with little to none space between them. And kernel procedures executing in perfect parallel with memory operations.
But memory operations are being executed one by one. H2D do not want to be paralleled with D2H operations.
Do i miss something here?
PS: i don’t want to put my code here because it is little complicated because of details. On the other hand the code is pretty standard for vector addition examples.
GeForce cards only have 1 asynchronous copy engine so you will not be able to do obtain concurrent h2d and d2h on your device. The exception is if the memory transfer is small then it may be implemented using a mechanism other than the copy engine.
Interesting observation! Based on my knowledge, I would have agreed with Greg: There is only one DMA engine on consumer GPUs. Now I wonder whether the two copy engines reported for your GTX 980 are a “premium feature” found on high-end consumer GPUs, a new feature on Maxwell-based consumer cards, a bug in the driver, or a bug in the app that reports the capabilities. When the CUDA 7.0 release goes final, I will check the documentation as to what it says about dual DMA engines.
I think the small transfers that Greg is referring to are those small host->device transfers that are injected directly into the GPU’s command queue and that are therefore independent of any potentially concurrent device->host transfera by a copy engine. That was more intended as a latency optimization though (instead of sending a command to the GPU that then turns on the DMA engine to fetch the data, just send the data itself), rather than an attempt to improve the concurrency of transfers. I seem to recall a 64 KB size limit for such copies. A microbenchmark could probably pinpoint the exact limit but I am too lazy right now to write one.
Have you checked the CUDA samples? While I am not aware of one, they may include a test that allows the testing of concurrent transfers. It is not difficult to write one, basically a simple bandwidth test with increasing block sizes with two CUDA streams that can copy simultaneously if the hardware/driver allows it. The difference in execution time to a single-stream configuration executing the same transfers should clearly show whether uploads and downloads happen simultaneously or not.
Which seems to indicate that it can perform a Device to Host copy concurrently with a Host to Device copy if they are in different streams.
Nvvp also shows graphically that those two copies of the same size in different directions occur during the same interval of time, which supports the statement that there are two two copy engines in the GTX 980.
When I run the same code on the same PC and specify the GTX 780ti GPU the output is different:
I agree, the logs definitely suggest the GTX 980 is overlapping the copies in opposite directions while the GTX 780 Ti is not. That is consistent with the number of copy engines reported for the two cards.
Cool! On the other hand it is weird. I have not come across any official reference to the dual copy engines on the GTX 980, you’d think this would be on a “new and improved” marketing slide somewhere.