device streams

I’m trying simple examples with dynamic parallelism. While launching child kernel, I firstly create device stream then launch it and generally I launch lots of child kernel.

Here, as my observation,

When I run my app without creating any stream, it works in any case :)
When I run my app with device streams, it doesn’t work. Moreover i can’t see any error or some message ?

I’ve also tried to use cudaStreamDestroy later on i launch child. but nothing changed. Besides, my app works, if i launch less amount of child kernel in any case.

Thank you very much in advance

My guess would be that you are not doing proper cuda error checking, for all the following cases:

  1. Every CUDA API call in host code
  2. Every kernel launch in host code
  3. Every CUDA API call in device code
  4. Every kernel launch in device code

You might want to read the programming guide section on dynamic parallelism:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

It discusses device streams and mentions that streams are shared amongst all threads in a block. Given that, calling cudaStreamDestroy in device code would have to be done carefully.

Failures when launching too many child kernels is a typical symptom of exceeding the launch pending limit, which is also covered in the documentation.

Only threadidx.x 0 creates device stream and destroy afterwards child kernel launch. May I reach the limit in this case ?

It doesn’t matter what threads are launching the kernels. It matters how many are launched. Try adding the error checking I described. It will then become obvious if you are hitting the limit.

I still couldn’t solve my problem. Currently i’m trying to do errorcheck but it seemed weird to me.

if(threadIdx.x == 0 )
{          
	cudaStream_t s;
	cudaError_t err = cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);         
	if (err != cudaSuccess) printf("%s\n", cudaGetErrorString(err));

	child_cdp_new<<< buff_counter, ((int)get_optimal_thread(sh_total/buff_counter) ) ,0,s>>>(...);

	err = cudaGetLastError();
	if (err != cudaSuccess) printf("%s\n", cudaGetErrorString(err));
}

But here as I can see, program doesn’t give error to create stream. However cudaGetLastError() return error either ‘no error’ which is not cudaSuccess or ‘invalid argument’.

How can I interpret all these things ?

Thanks in advance

Hi again,

I’ve found one thing more. As you can see in the code above, only threadidx==0 launch child. Here, if i create only one stream and bind all launches on it, my program is working. However if i create multiple streams like code above, it’s not working.

So, my question here

  1. Can I create more than one stream in the same cuda block?
  2. And also does it make sense to create more than one stream one the same CUDA block ? Because as I know, there is only one child kernel launch path in the SM.

Thanks in advance

The cudaGetErrorString function translates “cudaSuccess” to “no error”

If none of the API calls are returning errors, then you may simply have a program logic error. Launching kernels in multiple streams means they may run asynchronously with respect to each other. If data flows through from one kernel to another, you may simply have a program logic error in your processing sequence, that is introduced when you introduce streams.

You can use multiple streams, if you wish, in CDP device code. The scope of a stream is/should be limited to the threadblock that created it:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams-and-events

Not sure what you mean by this:

“Because as I know, there is only one child kernel launch path in the SM.”

Thank you very much @txbob for your help and very useful answers. Also this forum is quite useful to gain insight into cuda.

I wanted to say, (afaik) every SMX/SMM can launch only one kernel even if all threads in the same warp want to launch kernel. Is this right ? For example following code contains CDP. First half of warp is divergent from second half. In this case, As I know, first half of warp executes array store (line 5) at the same time. However, Can they launch child at the same time ?

__global__ kernel(int * arr)
{
   if(threadIdx.x < 16)
   {
       arr[threadIdx.x] = threadIdx.x;
       child<<<32,32>>>(); 
   }
   else     
       arr[threadIdx.x] = threadIdx.x;
}

int main()
{
  //...

  kernel<<<1,32>>(arr);
}

I don’t think the warp kernel launch characteristics are published or specified. You should assume they may all launch more-or-less concurrently or in any order. That is pretty much the definition of an asynchronous kernel launch, and the asynchronous nature applies at this level as well.

So I would say YES, those 16 threads can launch a kernel “at the same time”. How those kernels actually execute is a separate question, and is not specified.

It’s not clear what your concern is.

Thank you very much @txbob for your answer. I’m learning how CDP works right now, so my some words might be speculation.

Yes you’re right. I haven’t seen as well any explanation of device side kernel launch characteristic from nvidia. However, my all speculations are based on in this paper [1] :) Here at section 2.4, they say that “there is a path from each SMX to the KMU so that all the SMXs are able to issue new kernel launching commands to the KMU”. I used it as base, :

  • I thought that even if half of warp has same device-kernel launch instruction like previous code example, launch will be serialized since there is only one path to launch child from SMX to Kernel Management Unit (KMU).
  • I also thought that, When I bind every child kernel into different device streams, they may get run concurrently in maximum way. So in this case, kernel_1 in the following code should work faster right?
  • __global__ kernel_1(){   
      for(...)
       if(threadIdx.x == 0 )
       {
         cudaStream_t str;
         cudaStreamCreateWithFlags(&str, cudaStreamNonBlocking);
         child<<<32, 32, 0, str >>>(); 
       }
    }
    
    __global__ kernel_2(){
      cudaStream_t str;
      if(threadIdx.x == 0 )
       cudaStreamCreateWithFlags(&str, cudaStreamNonBlocking);
      
      for(...)
      {
        if(threadIdx.x == 0 )
         child<<<32, 32, 0, str >>>(); 
      }   
    }
    

    Thank you very much for your interest and answers in advance.

    [1] http://casl.gatech.edu/wp-content/uploads/2015/04/dtbl_isca42_jin.pdf

    If you know some informations about device kernel launch, I’d so appreciate.
    Thanks a lot.