About concurrent execution (overlap of data transfer and kernel execution)

dpig101 · March 7, 2013, 2:38pm

Hi,
I want to use concurrent execution to calculate C = A + B,
Here A is calculated in GPU, B for some reason can be only calculated in CPU, and C = A + B is in GPU.
My codes looked like that:
Using cudaHostAlloc() with cudaHostAllocMapped to allocate the memories for B;
Get time-stamp1;
A <<< >>> ( … );
B;
cudaDeviceSynchronize();
Get time-stamp2;
C = A + B;
I found that the time difference between time-stamp1 and time-stamp2 is the sum of execution time of A and B, do I make a wrong understanding on the way of concurrent execution?

I have checked asyncAPI, but it seems not help.
The vedio card is GTX 560Ti, and developing env is VS2008 + CUDA 5.0.

tera · March 7, 2013, 5:58pm

Insert

cudaStreamQuery(0);

after the kernel launch and before calculating B.

The WDDM driver tries to batch kernel launches to make up for the larger launch overhead on Windows. In your case this would cause the driver to hold up the kernel launch until the cudaDeviceSynchronize();, but a cudaStreamQuery(0); would trigger an immediate launch.

dpig101 · March 7, 2013, 8:51pm

Hi tera,

Many thanks! It works fine now!
I have another questions: Should I allocate the memories for B with cudaHostAllocDefault (but not cudaHostAllocMapped as I did before), and copy the results of B from CPU to GPU manually? In this case, the calculation and memcpy (from CPU to GPU) of B can be worked concurrently with calculation of A? In detailed, now the codes may look like:
Using cudaHostAlloc() with cudaHostAllocDefault to allocate the memories for B;
A <<< >>> ( … );
B;
Using cudaMemcpy to copy the results of B from CPU to GPU;
cudaDeviceSynchronize();
C = A + B;

I have met two issues of WDDM (another is the stop responding issue), is there a summary or list of them?

Regards,
Wu.dy

dpig101 · March 8, 2013, 4:22pm

Hi,
It seems that no matter which method I used, only the calculation of B (in CPU) can be done concurrently with calculation of A (in GPU), the transferring of B’s result from CPU to GPU will be done after A finished.
For example, using mapped paged-locked host memories, by checking the time, the transferring seems to be done when calculating “C = A + B;” (in GPU). Is it ture? Or I missed something?
Thanks in advance!
Wu.dy

tera · March 8, 2013, 4:31pm

Using mapped memory the transfer is indeed performed exactly when the data is needed on the GPU (i.e. during the summation).

However using a standard cudaMemcpy() after computing B you should be able to overlap the kernel for calculating A and the copy operation.

Biaowang · March 12, 2013, 11:56pm

to overlap you need two streams, one for calculation (in your case A). the other for copy (B)