Large overhead on cudaMemcpy, isolated case
Hello,

I have written a small CUDA program that performs Descrete Cosine Transform (DCT) of image frames of varying sizes, and I see an interesting result that I am unable to explain.

My testbed consists of two separate computers:
- A Dell XPS with a GM8400M GPU.
- An HP Compaq with a GTX280 and an NVS 295. The NVS card renders the display.

Please find attached my test results of DCT of varying matrix sizes, profile_detailed.pdf. The plot shows the individual steps taken to perform the transformation on different GPUs and CPUs. Notice that, for the GPUs that render the display (the GM8400M and the NVS295) the memory copy overhead is very large when compared to the card that isn't used for display rendering (the GTX280).

A quick look at the raw test data shows that, for both cases, the problem is the memory copy from device to host (the data has been simplified to shorten the size of the post):

TEST START
Running on GeForce 8400M GS
-> Performing DCT:width=512,height=512
cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000708327978
cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.174465611577
TEST END

TEST START
Running on Quadro NVS 295
-> Performing DCT:width=512,height=512
cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000250003010
cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.208029866219
TEST END

For the transfer size in question (1 MB), this corresponds to a bandwidth of about 5 MB/s for device to host transfers. This is directly in contradiction with a simple bandwidth test from the SDK example:

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)

992000 2459.8
996000 2467.2
1000000 2479.7
1004000 2538.4
1008000 2543.8

So, the question is why bandwidths of this type is so much lower for the GPU that runs the display. Another observation I have made regarding this is that the GPU that renders the display has a limited memory pool at around 30 - 45 MB. The GPU DCT test looks like this:

RUNTEST(cudaMalloc((void**)&cosTable, sizeof(float) * 8 * 8 * 8 * 8))
RUNTEST(cudaMalloc((void**)&qntTable, sizeof(float) * 8 * 8))
RUNTEST(cudaMalloc((void**)&data_d, sizeof(float) * width * height))
RUNTEST(cudaMalloc((void**)&odata_d, sizeof(float) * width * height))

RUNTESTKERN(gpuCalcCosineTable<<<cos_blockDim, cos_threadDim>>>(cosTable))
RUNTESTKERN(gpuCalcNormTable<<<qnt_blockDim, qnt_threadDim>>>(qntTable))

RUNTEST(cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice))

RUNTESTKERN(gpuSubtract<<<mNrm_blockDim, mNrm_threadDim>>>(data_d, width, height))
RUNTESTKERN(gpuInnerLoop<<<dctI_blockDim, dctI_threadDim>>>(cosTable, data_d, odata_d, width, height))

RUNTESTKERN(gpuNormalize<<<dctN_blockDim, dctN_threadDim>>>(odata_d, qntTable, width))

RUNTEST(cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost))

RUNTEST(cudaFree(cosTable))
RUNTEST(cudaFree(qntTable))
RUNTEST(cudaFree(data_d))
RUNTEST(cudaFree(odata_d))
Hello,



I have written a small CUDA program that performs Descrete Cosine Transform (DCT) of image frames of varying sizes, and I see an interesting result that I am unable to explain.



My testbed consists of two separate computers:

- A Dell XPS with a GM8400M GPU.

- An HP Compaq with a GTX280 and an NVS 295. The NVS card renders the display.



Please find attached my test results of DCT of varying matrix sizes, profile_detailed.pdf. The plot shows the individual steps taken to perform the transformation on different GPUs and CPUs. Notice that, for the GPUs that render the display (the GM8400M and the NVS295) the memory copy overhead is very large when compared to the card that isn't used for display rendering (the GTX280).



A quick look at the raw test data shows that, for both cases, the problem is the memory copy from device to host (the data has been simplified to shorten the size of the post):



TEST START

Running on GeForce 8400M GS

-> Performing DCT:width=512,height=512

cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000708327978

cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.174465611577

TEST END



TEST START

Running on Quadro NVS 295

-> Performing DCT:width=512,height=512

cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice), time=0.000250003010

cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost), time=0.208029866219

TEST END



For the transfer size in question (1 MB), this corresponds to a bandwidth of about 5 MB/s for device to host transfers. This is directly in contradiction with a simple bandwidth test from the SDK example:



Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)



992000 2459.8

996000 2467.2

1000000 2479.7

1004000 2538.4

1008000 2543.8



So, the question is why bandwidths of this type is so much lower for the GPU that runs the display. Another observation I have made regarding this is that the GPU that renders the display has a limited memory pool at around 30 - 45 MB. The GPU DCT test looks like this:



RUNTEST(cudaMalloc((void**)&cosTable, sizeof(float) * 8 * 8 * 8 * 8))

RUNTEST(cudaMalloc((void**)&qntTable, sizeof(float) * 8 * 8))

RUNTEST(cudaMalloc((void**)&data_d, sizeof(float) * width * height))

RUNTEST(cudaMalloc((void**)&odata_d, sizeof(float) * width * height))



RUNTESTKERN(gpuCalcCosineTable<<<cos_blockDim, cos_threadDim>>>(cosTable))

RUNTESTKERN(gpuCalcNormTable<<<qnt_blockDim, qnt_threadDim>>>(qntTable))



RUNTEST(cudaMemcpy(data_d, data, width * height * sizeof(float), cudaMemcpyHostToDevice))



RUNTESTKERN(gpuSubtract<<<mNrm_blockDim, mNrm_threadDim>>>(data_d, width, height))

RUNTESTKERN(gpuInnerLoop<<<dctI_blockDim, dctI_threadDim>>>(cosTable, data_d, odata_d, width, height))



RUNTESTKERN(gpuNormalize<<<dctN_blockDim, dctN_threadDim>>>(odata_d, qntTable, width))



RUNTEST(cudaMemcpy(out, odata_d, width * height * sizeof(float), cudaMemcpyDeviceToHost))



RUNTEST(cudaFree(cosTable))

RUNTEST(cudaFree(qntTable))

RUNTEST(cudaFree(data_d))

RUNTEST(cudaFree(odata_d))
Attachments

profile_detailed.pdf

#1
Posted 04/18/2012 09:47 AM   
The device to host transfer time includes the time for waiting on the calculation to finish. If you want to measure transfer time alone, insert a cudaDeviceSynchronize() call before timing the device to host transfer.
The device to host transfer time includes the time for waiting on the calculation to finish. If you want to measure transfer time alone, insert a cudaDeviceSynchronize() call before timing the device to host transfer.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/18/2012 10:18 AM   
Nice. I didn't know that the execution configuration syntax returned immediately..
Nice. I didn't know that the execution configuration syntax returned immediately..

#3
Posted 04/18/2012 12:12 PM   
Scroll To Top