Concurrent Data Transfers

Hello,
are two or more calls of cudaMemcpyAsync(cudaMemcpyHostToDevice) concurrent?

On just about all hardware, no. Only kernel execution and copying can be overlapped. The new Fermi based Tesla C2050 has 2 PCI-e DMA engines and might well be able to overlap copies, although I haven’t tested it yet to see whether it works.

Tank you!

I use GTX480.

1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?

2.Is the order important?

cudaMemcpyAsync(cudaMemcpyHostToDevice)

cudaMemcpyAsync(cudaMemcpyDeviceToHost)

or

cudaMemcpyAsync(cudaMemcpyDeviceToHost)

cudaMemcpyAsync(cudaMemcpyHostToDevice)

3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?

On Fermi based Tesla cards, I think so. On other Fermi based cards, no.

I don’t know the answer to that, but I suspect no.

I believe all concurrent/asynchronous memory operations only work via the streams interface.

Is it not for GTX480 valid?

Some Fermi features are downgraded on the GeForce cards:

http://forums.nvidia.com/index.php?showtopic=165055

According to this, only 1 DMA engine is enabled on the GeForce cards. The programming guide should be modified to make this clear.

As I said in previous posts, I don’t believe it is valid for the consumer Fermi cards, only Tesla.

I have been seeing strange numbers, where GTX480 can overlap H2D and D2H while C2050 can’t

(GTX280 and GTX480 are in the same machine, C2050 is in a different one)

$ ./concur_bandwidth  0

device 0: GeForce GTX 480

Device 0 took 3000.489502 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594

Device 0 took 3006.603027 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 6621.408203

Device 0 took 2995.593994 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 11184.810547

$ ./concur_bandwidth  1

device 1: GeForce GTX 280

Device 1 took 2999.640137 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594

Device 1 took 3000.135498 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 5860.841309

Device 1 took 2978.960693 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 5905.580078

$ ./concur_bandwidth 0

device 0: Tesla C2050

Device 0 took 3006.502441 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 6129.276855

Device 0 took 2990.946533 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 5681.883789

Device 0 took 2988.590332 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 6889.844238

The test is the slightly modified concurrent test from

http://forums.nvidia.com/index.php?showtopic=86536

Thank you!!!

Can you please post the complete code so I can test it on my computer?
Because I cannot get D2H to overlap with H2D on a GTS450.
Thanks