Concurrent Data Transfers
Hello,
are two or more calls of cudaMemcpyAsync(cudaMemcpyHostToDevice) concurrent?
Hello,

are two or more calls of cudaMemcpyAsync(cudaMemcpyHostToDevice) concurrent?

#1
Posted 06/13/2010 10:42 AM   
On just about all hardware, no. Only kernel execution and copying can be overlapped. The new Fermi based Tesla C2050 has 2 PCI-e DMA engines and might well be able to overlap copies, although I haven't tested it yet to see whether it works.
On just about all hardware, no. Only kernel execution and copying can be overlapped. The new Fermi based Tesla C2050 has 2 PCI-e DMA engines and might well be able to overlap copies, although I haven't tested it yet to see whether it works.

#2
Posted 06/13/2010 10:55 AM   
Tank you!
I use GTX480.

1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?

2.Is the order important?
[code]cudaMemcpyAsync(cudaMemcpyHostToDevice)
cudaMemcpyAsync(cudaMemcpyDeviceToHost)

or

cudaMemcpyAsync(cudaMemcpyDeviceToHost)
cudaMemcpyAsync(cudaMemcpyHostToDevice)[/code]

3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?
Tank you!

I use GTX480.



1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?



2.Is the order important?

cudaMemcpyAsync(cudaMemcpyHostToDevice)

cudaMemcpyAsync(cudaMemcpyDeviceToHost)



or



cudaMemcpyAsync(cudaMemcpyDeviceToHost)

cudaMemcpyAsync(cudaMemcpyHostToDevice)




3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?

#3
Posted 06/13/2010 11:09 AM   
[quote name='Deus' post='1072729' date='Jun 13 2010, 02:09 PM']1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?[/quote]
On Fermi based Tesla cards, I think so. On other Fermi based cards, no.

[quote]2.Is the order important?[/quote]
I don't know the answer to that, but I suspect no.

[quote]3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?[/quote]
I believe all concurrent/asynchronous memory operations only work via the streams interface.
[quote name='Deus' post='1072729' date='Jun 13 2010, 02:09 PM']1.But the copy cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost is concurrent on Fermi?

On Fermi based Tesla cards, I think so. On other Fermi based cards, no.



2.Is the order important?


I don't know the answer to that, but I suspect no.



3.Can I use cudaMemcpy for concurrent copy or only cudaMemcpyAsync with different streams?


I believe all concurrent/asynchronous memory operations only work via the streams interface.

#4
Posted 06/13/2010 12:58 PM   
[quote]NVIDIA CUDA Programming Guide 3.0
3.2.6.4 Concurrent Data Transfers
Devices of compute capability 2.0 can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory.[/quote]
Is it not for GTX480 valid?
NVIDIA CUDA Programming Guide 3.0

3.2.6.4 Concurrent Data Transfers

Devices of compute capability 2.0 can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory.


Is it not for GTX480 valid?

#5
Posted 06/13/2010 06:55 PM   
[quote name='Deus' post='1072857' date='Jun 13 2010, 12:55 PM']Is it not for GTX480 valid?[/quote]

Some Fermi features are downgraded on the GeForce cards:

[url="http://forums.nvidia.com/index.php?showtopic=165055"]http://forums.nvidia.com/index.php?showtopic=165055[/url]

According to this, only 1 DMA engine is enabled on the GeForce cards. The programming guide should be modified to make this clear.
[quote name='Deus' post='1072857' date='Jun 13 2010, 12:55 PM']Is it not for GTX480 valid?



Some Fermi features are downgraded on the GeForce cards:



http://forums.nvidia.com/index.php?showtopic=165055



According to this, only 1 DMA engine is enabled on the GeForce cards. The programming guide should be modified to make this clear.

#6
Posted 06/13/2010 07:29 PM   
[quote name='Deus' post='1072857' date='Jun 13 2010, 09:55 PM']Is it not for GTX480 valid?[/quote]
As I said in previous posts, I don't believe it is valid for the consumer Fermi cards, only Tesla.
[quote name='Deus' post='1072857' date='Jun 13 2010, 09:55 PM']Is it not for GTX480 valid?

As I said in previous posts, I don't believe it is valid for the consumer Fermi cards, only Tesla.

#7
Posted 06/13/2010 07:45 PM   
I have been seeing strange numbers, where GTX480 can overlap H2D and D2H while C2050 can't
(GTX280 and GTX480 are in the same machine, C2050 is in a different one)

[code]
$ ./concur_bandwidth 0
device 0: GeForce GTX 480
Device 0 took 3000.489502 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594
Device 0 took 3006.603027 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 6621.408203
Device 0 took 2995.593994 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 11184.810547


$ ./concur_bandwidth 1
device 1: GeForce GTX 280
Device 1 took 2999.640137 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594
Device 1 took 3000.135498 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 5860.841309
Device 1 took 2978.960693 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 5905.580078


$ ./concur_bandwidth 0
device 0: Tesla C2050
Device 0 took 3006.502441 ms
Test 1: Aggregate HtoD bandwidth in MB/s: 6129.276855
Device 0 took 2990.946533 ms
Test 2: Aggregate DtoH bandwidth in MB/s: 5681.883789
Device 0 took 2988.590332 ms
Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 6889.844238[/code]

The test is the slightly modified concurrent test from
[url="http://forums.nvidia.com/index.php?showtopic=86536"]http://forums.nvidia.com/index.php?showtopic=86536[/url]


[quote name='avidday' post='1072889' date='Jun 13 2010, 02:45 PM']As I said in previous posts, I don't believe it is valid for the consumer Fermi cards, only Tesla.[/quote]
I have been seeing strange numbers, where GTX480 can overlap H2D and D2H while C2050 can't

(GTX280 and GTX480 are in the same machine, C2050 is in a different one)





$ ./concur_bandwidth 0

device 0: GeForce GTX 480

Device 0 took 3000.489502 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594

Device 0 took 3006.603027 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 6621.408203

Device 0 took 2995.593994 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 11184.810547





$ ./concur_bandwidth 1

device 1: GeForce GTX 280

Device 1 took 2999.640137 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 5995.058594

Device 1 took 3000.135498 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 5860.841309

Device 1 took 2978.960693 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 5905.580078





$ ./concur_bandwidth 0

device 0: Tesla C2050

Device 0 took 3006.502441 ms

Test 1: Aggregate HtoD bandwidth in MB/s: 6129.276855

Device 0 took 2990.946533 ms

Test 2: Aggregate DtoH bandwidth in MB/s: 5681.883789

Device 0 took 2988.590332 ms

Test 3: Aggregate bidirectional per GPU bandwidth in MB/s: 6889.844238




The test is the slightly modified concurrent test from

http://forums.nvidia.com/index.php?showtopic=86536





[quote name='avidday' post='1072889' date='Jun 13 2010, 02:45 PM']As I said in previous posts, I don't believe it is valid for the consumer Fermi cards, only Tesla.

#8
Posted 06/14/2010 03:31 PM   
Thank you!!!
Thank you!!!

#9
Posted 06/14/2010 07:50 PM   
Can you please post the complete code so I can test it on my computer?
Because I cannot get D2H to overlap with H2D on a GTS450.
Thanks
Can you please post the complete code so I can test it on my computer?

Because I cannot get D2H to overlap with H2D on a GTS450.

Thanks

#10
Posted 04/27/2012 02:37 PM   
Scroll To Top