Concurrent Kernels

We’ve just solved a concurrency problem in another thread on this forum.

It was the work queues of the WDDM drivers in Vista, Windows 7 and 8, causing kernel calls to get delayed. There we inserted a cudaStreamQuery(streams) right after the kernel launches to force the work queue to be flushed and the kernel launch to begin immedately. Try this here, maybe?

Christian

1 Like

Hello Christian,

Thank you for your help, but it doesn’t work.

I saw the post in https://devtalk.nvidia.com/default/topic/538148/cuda-programming-and-performance/overlapping-cpu-and-gpu-operations-using-streams-total-failure-any-help-/post/3775423/#3775423

Modifying the code to:

cudaStatus = cudaStreamQuery(streams[0]);
if (cudaStatus==cudaSuccess) {
    clock_block0<<<3,768,0,streams[0]>>>(100000);
    cudaStreamQuery(streams[0]);
    i++;
}
		
cudaStatus = cudaStreamQuery(streams[1]);
if (cudaStatus==cudaSuccess) {
    clock_block1<<<1,768,0,streams[1]>>>(200000);
    cudaStreamQuery(streams[1]);
    i++;
}

I get the following result:

On Windows OS using WDDM driver model (Vista, 7, 8) work is submitted to the GPU in command buffers. Only work in the same command buffer can be executed concurrently. cudaStreamQuery(stream) flushes the CUDA user mode work queue. Each flush results in a WDDM KMD command buffer. You only want to flush the user mode work queue when you want to do additional CPU processing. Calling a *Synchronize call or a synchronous command will also flush the user mode work queue.

Your example should result in concurrent execution on the Windows XP, Linux, MacOSX, and the Windows Tesla Compute Cluster (TCC) driver.