Dynamic Parallelism, HyperQ and cudaDeviceSynchronize()

I can’t find any concrete information on this…

I have 7 CPU threads launching a set of 7 “series-of-kernels” to be executed concurrently (using HyperQ) with their own stream (7 streams active at any one time). Each stream launches a single kernel that loops launching a subset of child kernels using dynamic parallelism.

Pseudocode:

__global__ void Parent()
{
    for( int i = 0; StillActive && (i < MaxNumIterations); i++ )
    {
         InitializationKernel<<<InitGrid, InitBlock>>>();  // Sets globalMemoryResult's value

         __syncthreads();  
         bool StillActive = globalMemoryResult;
         __syncthreads();

         if( StillActive )
         {
               Work1Kernel<<<Work1Grid, Work1Block>>>();
               Work2Kernel<<<Work2Grid, Work2Block>>>();
               ....
               WorkNKernel<<<WorkNGrid, WorkNBlock>>>();
         }
    }

    __syncthreads();  
         
    FinalizeKernel<<<FinalizeGrid, FinalizeBlock>>>();
}

There are some inherent problems with this…

  1. Most of the work is dependent on if StillActive returns true or not. This means that there needs to be synchronization after InitializationKernel. I need to know that InitializationKernel is complete and a subsequent read from globalMemoryResult will return correct data.
  2. The loop counter is dependent on the results of the loop contents. Its intent is to break out of the loop when there is no more work to do. Unfortunately, this condition is entirely data dependent, so it is impossible to predict if the loop will break out after one iteration or go to MaxNumIterations. Again the loop itself and FinalizeKernel is entirely dependent on the child kernels completing and having set appropriate state deterministically.

Using __syncthreads is insufficient because it seems it only works for ensuring thread synchronization for the current warp, and child kernel launches don’t seem to block parent execution so the parent will happily synchronize its threads while the child kernels are not yet complete. Meaning there is a race on the value of globalMemoryResult.

I used cudaDeviceSynchronize, which works great functionally. All child kernels complete, the results are available and everything continues on happily. The problem is that this call seems to cost up to 100us which is huge because the entirety of the cost of the child kernels add up to 200us-300us, so as much as 30% of the total cost of this code is executing cudaDeviceSynchronize.

I also can’t find any concrete documentation of cudaDeviceSynchronize as it pertains to its use with HyperQ and Dynamic Parallelism. The very simple description shown in the API documentation says “Blocks until the device has completed all preceding requested tasks…”. This isn’t what is needed, I need the current (parent) kernel’s stream to block until all child kernels have completed. I DON’T want to block the entire device since there are other streams potentially executing other kernels separately and concurrently (via HyperQ).

(http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g10e20b05a95f638a4071a655503df25d)

I would settle for the current stream to block until all preceding requested tasks have completed. Basically cudaStreamSynchronize. I don’t know if it is possible for a kernel to know its own stream so it can block on any remaining work for child kernels.

I then attempted to create child streams, add do cudaStreamWaitEvent. This doesn’t seem to work either.

In summary:
I basically need suggestions on how to block a DP stream from continuing until the child kernels have completed WITHOUT using cudaDeviceSynchronization.

Is there a way to get a kernel’s “this-stream”. i.e. from within a kernel, is there a way to get the stream it is working within so it can do a cudaStreamSynchronize on work it and its children are working on ONLY?

Should using a child-stream and event synchronization work?

This is a moderately old thread, but I’ve encountered a roughly similar issue.

Just as you did, I can’t seem to get cudaStreamWaitEvent to actually function properly on device-launched kernels. Example pseudo-code to show what I mean, using somewhat of a producer-consumer model with one warp ‘producing’ a cudaEventRecord and another warp ‘consuming’ that event with cudaStreamEventWait.

__global__ kernel(...){
  cudaStream_t streamAlpha, streamBeta;
  cudaStreamCreateWithFlags(&streamAlpha,cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&streamBeta,cudaStreamNonBlocking);

  cudaEvent_t;
  cudaEventCreateWithFlags(&event, cudaEventDisableTiming);

  __shared__ volatile unsigned int flag; // volatile to make sure the write is after cudaEventRecord
  if(threadIdx.x == 0) flag = 0;
  __synchtreads();

  int warpIdx = threadIdx.x / 32;

  if( warpIdx == 0){ // using one particular warp 
    if(threadIdx.x == 0){
      kernelAlpha<<<....,streamAlpha>>>(...);
      cudaEventRecord(event,streamAlpha);
      flag = 1;
    }
  }else if(warpIdx == 1){// use a different warp    
    while( flag != 1){ // wait for the flag to be updated
    if( threadIdx.x == 0){
      cudaStreamEventWait( streamBeta, event, 0); // make streamBeta wait on kernelAlpha
      kernelBeta<<<...,streamBeta>>>(...); // kernelBeta shouldn't execute until kernelAlpha is done
    }
  }
  ...
}

The problem I’ve found is that sometimes kernelBeta executes simultaneously to kernelAlpha, despite the attempt as using a cudaStreamEventWait. This can be seen, for example, in the NVIDIA Visual Profiler: the two kernels (if they’re sufficiently small enough) overlapping in execution. Though, admittedly, the way the producer-consumer model works is very touchy to write-before-read issues surrounding the ‘flag’ variable vs the cudaEventRecord. Hence, throwing in the ‘volatile’ keyword.

One way I’ve worked around this is by making my own equivalent of cudaEventQuery:

__global__ kernelAlpha(..., unsigned int* alphaBlockCount){
  ...
  ...
  if(threadIdx.x == 0)
    atomicInc( alphaBlockCount, gridDim.x ); // increment the block count
}

__global__ masterKernel(...){
  unsigned int* alphaBlockCount;
  if(threadIdx.x==0){
    alphaBlockCount = (unsigned int*)malloc(sizeof(unsigned int)); //allocate a global variable
    memset(alphaBlockCount,0,sizeof(unsigned int));
  }
  cudaStream_t streamAlpha, streamBeta;
  cudaStreamCreateWithFlags(&streamAlpha,cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&streamBeta,cudaStreamNonBlocking);

  cudaEvent_t;
  cudaEventCreateWithFlags(&event, cudaEventDisableTiming);

  int warpIdx = threadIdx.x / 32;

  unsigned int numBlocks = ...;//number of blocks to use for kernelAlpha
  if( warpIdx == 0){ // using one particular warp 
    if(threadIdx.x == 0){
      kernelAlpha<<<numBlocks,...,streamAlpha>>>(...,alphaBlockCount);
    }
  }else if(warpIdx == 1){// use a different warp    
    if( threadIdx.x == 0){
      while( *alphaBlockCount != numBlocks){ // wait for every block of kernelAlpha to have completed
         kernelBeta<<<...,streamBeta>>>(...); 
      }
  }
  ...
  __cudaDeviceSynchronize();
  __syncThreads();
  if(threadIdx.x==0){
    free(alphaBlockCount);
  }

}

This seems a bit silly, but hey…It seems to work. I’ve kinda wondered if i technically need a __threadfence() in the first kernel, but I haven’t been able to show any correctness issues (probably because, in practice, the launch of kernelBeta is long enough after the completion of kernelAlpha that any lingering memory visibility issues aren’t applicable).