Simultaneous execution of host and device launched kernels?

CUDA 7.0, Maxwell (Titan X)

Create 2 non-blocking queues HostQ1, HostQ2

Async launch kernel DyParK into queue HostQ1
Query HostQ1 to make sure DyParK starts
DyParK creates a non-blocking queue DeviceQ1 and periodically queues small kernels into it
After about 500 ms, all children of DyParK complete and DyParK exits

However, the host code waits for 100 ms (DyParK and children are running) and async queues a few hundred small kernels into HostQ2 with a queue query after each one to make sure they start.

I hoped that there would be some overlap of the device and host launched kernels but the visual profiler shows there is not. All the host launches into HostQ2 are queued long before DyParK finishes, but the first one commences execution immediately after DyParK finishes.

DyParK is a single block and according to the profiler, there are large gaps of time where none of the device launched kernels are running.

Why are none of the host launched kernels executing until after the DyParK kernel finishes?

Scott

assuming you use pinned memory for at least d->h transfers

does the dypark kernel itself make allocations, or not?

“with a queue query after each one to make sure they start”

meaning what exactly?
how do you query?
and how does the host synchronize on device work?

are there any dependencies between “DyParK and children” and “a few hundred small kernels into HostQ2”?
in essence, can you change the order around?

There are no data transfers on or off the board, this is compute only.

No kernel allocates memory

Queue a kernel to , then make a single call to cudaStreamQuery(streamN) to cause the stream to flush and execution on the kernel to commence.

After the kernels are queued to HostQ2, I do a CudaStreamSynchronize(HostQ1) to wait for the DyParK kernel to complete.

The fundamental question is: has anyone successfully executed host-dispatched kernels and had them execute while device-dispatched kernels are executed or is the execution of host & device dispatched kernels mutually exclusive?

"The fundamental question is: has anyone successfully executed host-dispatched kernels and had them execute while device-dispatched kernels are executed or is the execution of host & device dispatched kernels mutually exclusive? "

i believe i have managed this before; it was some time ago, though - i have moved away from dynamic parallelism since, as i failed to see the value proposition

also, i do not recall anything from the programming guide stating the contrary

nevertheless, your fundamental question does not guarantee execution
there are normally numerous ways to render the possible, impossible

perhaps post your profiler output, and/ or some code
or perhaps someone else knows better

A similar question was asked here:

[url]cuda - Overlap kernel execution on multiple streams - Stack Overflow

And the answer demonstrates that two parent kernels along with child kernels spawned from those two parent kernels can all run concurrently. So I think the simple answer to your question is yes, its theoretically possible.

Witnessing kernel concurrency requires a number of things. You might want to read that answer to get a gist.

Windows and the WDDM driver model can also make it more difficult to witness concurrency.