Memory copy improvement ?
Hello,

I'm actually working on developing real-time algorithms for image processing.
I tried a few simple algorithms (like histogram equalization), but I've encountered a "huge" bottleneck for my real time application : the memory copy.
I process 1280*1024 pixel images and the memory copy using cudaMemcpy is way too long (about 1 to 2 ms for this size of array). I tried using cudaHostAlloc and cudaHostGetDevicePointer to "optimize" the flow of data, it actually do like nothing, instead of having long memory copy, I'm having longer kernel executions (maybe it comes from the kernel itself).

What is the most efficient way of copying memory from host to device (in term of processing speed) ?
Is there any way to work quickly with the host memory ?

Thank you for your future answers.
Hello,



I'm actually working on developing real-time algorithms for image processing.

I tried a few simple algorithms (like histogram equalization), but I've encountered a "huge" bottleneck for my real time application : the memory copy.

I process 1280*1024 pixel images and the memory copy using cudaMemcpy is way too long (about 1 to 2 ms for this size of array). I tried using cudaHostAlloc and cudaHostGetDevicePointer to "optimize" the flow of data, it actually do like nothing, instead of having long memory copy, I'm having longer kernel executions (maybe it comes from the kernel itself).



What is the most efficient way of copying memory from host to device (in term of processing speed) ?

Is there any way to work quickly with the host memory ?



Thank you for your future answers.

#1
Posted 04/19/2012 01:15 PM   
Are you using pinned memory (cudaMallocHost)?

You can measure how much time it takes to transfer the array and if you get much less than 6 GB/s, there is something wrong. Otherwise there not much can be done unless transferring only part of the data is still OK. Also, you can consider getting Kepler - it has PCIe 3.0 that should be 2x faster.
Are you using pinned memory (cudaMallocHost)?



You can measure how much time it takes to transfer the array and if you get much less than 6 GB/s, there is something wrong. Otherwise there not much can be done unless transferring only part of the data is still OK. Also, you can consider getting Kepler - it has PCIe 3.0 that should be 2x faster.

#2
Posted 04/19/2012 02:53 PM   
Hello,

I tried using pinned memory, but the time I earn in memory copy is lost in the kernel (maybe because of the way it's processing).
For example, I tried this on my histogram kernel and NPP "Not arithmetic function" kernel. I measured the time spending to process this, and it shows a huge slow down for the NPP kernel (from 100us to 30ms...) and a little slow down of the histogram kernel (from 6 ms to 7ms).

I benchmarked the time spending to transfer the array, and it gives me that :
[URL=http://imageshack.us/photo/my-images/220/memcopy.jpg/][IMG]http://img220.imageshack.us/img220/8660/memcopy.jpg[/IMG][/URL]

Uploaded with [URL=http://imageshack.us]ImageShack.us[/URL]


The bandwidth program provided in the SDK gave me a bandwidth of 1.4GB/s (which is indeed very slow...).

I'm using a Tesla S1070.

Thank you for your answer.
Hello,



I tried using pinned memory, but the time I earn in memory copy is lost in the kernel (maybe because of the way it's processing).

For example, I tried this on my histogram kernel and NPP "Not arithmetic function" kernel. I measured the time spending to process this, and it shows a huge slow down for the NPP kernel (from 100us to 30ms...) and a little slow down of the histogram kernel (from 6 ms to 7ms).



I benchmarked the time spending to transfer the array, and it gives me that :

Image



Uploaded with ImageShack.us





The bandwidth program provided in the SDK gave me a bandwidth of 1.4GB/s (which is indeed very slow...).



I'm using a Tesla S1070.



Thank you for your answer.
Attachments

memcopy.jpg

#3
Posted 04/20/2012 06:34 AM   
Uh, I see. cudaHostAlloc looks like a new name for cudaMallocHost.

S1070 should have PCIe 2.0, so I'd expect 6 GB/s or at least over 5. Some specs, however, mention that x8 is also possible, which is twice slower - http://www.nvidia.co.uk/object/tesla_s1070_uk.html

The graph shows ~1.1 GB/s, about the same as 1.4 GB/s. Could you run that SDK program in pinned mode, i.e. using --memory=pinned option?

I don't entirely understand how pinned memory can slow down the kernel - as long as the data is in the GPU memory, it should not matter... May be this is a problem of measurement, say, do you use asynchronous transfers?
Uh, I see. cudaHostAlloc looks like a new name for cudaMallocHost.



S1070 should have PCIe 2.0, so I'd expect 6 GB/s or at least over 5. Some specs, however, mention that x8 is also possible, which is twice slower - http://www.nvidia.co.uk/object/tesla_s1070_uk.html



The graph shows ~1.1 GB/s, about the same as 1.4 GB/s. Could you run that SDK program in pinned mode, i.e. using --memory=pinned option?



I don't entirely understand how pinned memory can slow down the kernel - as long as the data is in the GPU memory, it should not matter... May be this is a problem of measurement, say, do you use asynchronous transfers?

#4
Posted 04/20/2012 07:20 AM   
Using -memory=pinned option, I have the same results (1.5 GB/s)...

By asynchronous transfers, you mean using cudaMemcpyAsync ? If yes, I don't.
Using -memory=pinned option, I have the same results (1.5 GB/s)...



By asynchronous transfers, you mean using cudaMemcpyAsync ? If yes, I don't.

#5
Posted 04/20/2012 07:59 AM   
Maybe the problem is with the host mainboard(slot operating in 8x or PCIe 1.1 mode).
Maybe the problem is with the host mainboard(slot operating in 8x or PCIe 1.1 mode).

#6
Posted 04/20/2012 03:39 PM   
What is your motherboard?
On both X48 and P45 I have seen roughly 5.2GB/s pinned and 3- 3.2GB/s pageable. 1.4 is unbelievably slow.
What is your motherboard?

On both X48 and P45 I have seen roughly 5.2GB/s pinned and 3- 3.2GB/s pageable. 1.4 is unbelievably slow.

#7
Posted 04/25/2012 08:19 PM   
Scroll To Top