Hello,
are two or more calls of cudaMemcpyAsync(cudaMemcpyHostToDevice) concurrent?
On just about all hardware, no. Only kernel execution and copying can be overlapped. The new Fermi based Tesla C2050 has 2 PCI-e DMA engines and might well be able to overlap copies, although I haven’t tested it yet to see whether it works.
Tank you!
I use GTX480.
1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?
2.Is the order important?
cudaMemcpyAsync(cudaMemcpyHostToDevice)
cudaMemcpyAsync(cudaMemcpyDeviceToHost)
or
cudaMemcpyAsync(cudaMemcpyDeviceToHost)
cudaMemcpyAsync(cudaMemcpyHostToDevice)
3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?
On Fermi based Tesla cards, I think so. On other Fermi based cards, no.
I don’t know the answer to that, but I suspect no.
I believe all concurrent/asynchronous memory operations only work via the streams interface.
Is it not for GTX480 valid?
Some Fermi features are downgraded on the GeForce cards:
http://forums.nvidia.com/index.php?showtopic=165055
According to this, only 1 DMA engine is enabled on the GeForce cards. The programming guide should be modified to make this clear.
As I said in previous posts, I don’t believe it is valid for the consumer Fermi cards, only Tesla.
I have been seeing strange numbers, where GTX480 can overlap H2D and D2H while C2050 can’t
(GTX280 and GTX480 are in the same machine, C2050 is in a different one)
$ ./concur_bandwidth 0
device 0: GeForce GTX 480
Device 0 took 3000.489502 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594
Device 0 took 3006.603027 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 6621.408203
Device 0 took 2995.593994 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 11184.810547
$ ./concur_bandwidth 1
device 1: GeForce GTX 280
Device 1 took 2999.640137 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594
Device 1 took 3000.135498 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 5860.841309
Device 1 took 2978.960693 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 5905.580078
$ ./concur_bandwidth 0
device 0: Tesla C2050
Device 0 took 3006.502441 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 6129.276855
Device 0 took 2990.946533 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 5681.883789
Device 0 took 2988.590332 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 6889.844238
The test is the slightly modified concurrent test from
Thank you!!!
Can you please post the complete code so I can test it on my computer?
Because I cannot get D2H to overlap with H2D on a GTS450.
Thanks