About concurrent execution (overlap of data transfer and kernel execution)

Hi,
I want to use concurrent execution to calculate C = A + B,
Here A is calculated in GPU, B for some reason can be only calculated in CPU, and C = A + B is in GPU.
My codes looked like that:
Using cudaHostAlloc() with cudaHostAllocMapped to allocate the memories for B;
Get time-stamp1;
A <<< >>> ( … );
B;
cudaDeviceSynchronize();
Get time-stamp2;
C = A + B;
I found that the time difference between time-stamp1 and time-stamp2 is the sum of execution time of A and B, do I make a wrong understanding on the way of concurrent execution?

I have checked asyncAPI, but it seems not help.
The vedio card is GTX 560Ti, and developing env is VS2008 + CUDA 5.0.

Insert

cudaStreamQuery(0);

after the kernel launch and before calculating B.

The WDDM driver tries to batch kernel launches to make up for the larger launch overhead on Windows. In your case this would cause the driver to hold up the kernel launch until the cudaDeviceSynchronize();, but a cudaStreamQuery(0); would trigger an immediate launch.

Hi tera,

Many thanks! It works fine now!
I have another questions: Should I allocate the memories for B with cudaHostAllocDefault (but not cudaHostAllocMapped as I did before), and copy the results of B from CPU to GPU manually? In this case, the calculation and memcpy (from CPU to GPU) of B can be worked concurrently with calculation of A? In detailed, now the codes may look like:
Using cudaHostAlloc() with cudaHostAllocDefault to allocate the memories for B;
A <<< >>> ( … );
B;
Using cudaMemcpy to copy the results of B from CPU to GPU;
cudaDeviceSynchronize();
C = A + B;

I have met two issues of WDDM (another is the stop responding issue), is there a summary or list of them?

Regards,
Wu.dy

Hi,
It seems that no matter which method I used, only the calculation of B (in CPU) can be done concurrently with calculation of A (in GPU), the transferring of B’s result from CPU to GPU will be done after A finished.
For example, using mapped paged-locked host memories, by checking the time, the transferring seems to be done when calculating “C = A + B;” (in GPU). Is it ture? Or I missed something?
Thanks in advance!
Wu.dy

Using mapped memory the transfer is indeed performed exactly when the data is needed on the GPU (i.e. during the summation).

However using a standard cudaMemcpy() after computing B you should be able to overlap the kernel for calculating A and the copy operation.

to overlap you need two streams, one for calculation (in your case A). the other for copy (B)