I can’t find any concrete information on this…
I have 7 CPU threads launching a set of 7 “series-of-kernels” to be executed concurrently (using HyperQ) with their own stream (7 streams active at any one time). Each stream launches a single kernel that loops launching a subset of child kernels using dynamic parallelism.
Pseudocode:
__global__ void Parent()
{
for( int i = 0; StillActive && (i < MaxNumIterations); i++ )
{
InitializationKernel<<<InitGrid, InitBlock>>>(); // Sets globalMemoryResult's value
__syncthreads();
bool StillActive = globalMemoryResult;
__syncthreads();
if( StillActive )
{
Work1Kernel<<<Work1Grid, Work1Block>>>();
Work2Kernel<<<Work2Grid, Work2Block>>>();
....
WorkNKernel<<<WorkNGrid, WorkNBlock>>>();
}
}
__syncthreads();
FinalizeKernel<<<FinalizeGrid, FinalizeBlock>>>();
}
There are some inherent problems with this…
- Most of the work is dependent on if StillActive returns true or not. This means that there needs to be synchronization after InitializationKernel. I need to know that InitializationKernel is complete and a subsequent read from globalMemoryResult will return correct data.
- The loop counter is dependent on the results of the loop contents. Its intent is to break out of the loop when there is no more work to do. Unfortunately, this condition is entirely data dependent, so it is impossible to predict if the loop will break out after one iteration or go to MaxNumIterations. Again the loop itself and FinalizeKernel is entirely dependent on the child kernels completing and having set appropriate state deterministically.
Using __syncthreads is insufficient because it seems it only works for ensuring thread synchronization for the current warp, and child kernel launches don’t seem to block parent execution so the parent will happily synchronize its threads while the child kernels are not yet complete. Meaning there is a race on the value of globalMemoryResult.
I used cudaDeviceSynchronize, which works great functionally. All child kernels complete, the results are available and everything continues on happily. The problem is that this call seems to cost up to 100us which is huge because the entirety of the cost of the child kernels add up to 200us-300us, so as much as 30% of the total cost of this code is executing cudaDeviceSynchronize.
I also can’t find any concrete documentation of cudaDeviceSynchronize as it pertains to its use with HyperQ and Dynamic Parallelism. The very simple description shown in the API documentation says “Blocks until the device has completed all preceding requested tasks…”. This isn’t what is needed, I need the current (parent) kernel’s stream to block until all child kernels have completed. I DON’T want to block the entire device since there are other streams potentially executing other kernels separately and concurrently (via HyperQ).
I would settle for the current stream to block until all preceding requested tasks have completed. Basically cudaStreamSynchronize. I don’t know if it is possible for a kernel to know its own stream so it can block on any remaining work for child kernels.
I then attempted to create child streams, add do cudaStreamWaitEvent. This doesn’t seem to work either.
In summary:
I basically need suggestions on how to block a DP stream from continuing until the child kernels have completed WITHOUT using cudaDeviceSynchronization.
Is there a way to get a kernel’s “this-stream”. i.e. from within a kernel, is there a way to get the stream it is working within so it can do a cudaStreamSynchronize on work it and its children are working on ONLY?
Should using a child-stream and event synchronization work?