Is this strange behaviour with kernel concurrency?

Hi experts,

I am trying to get more deep into the kernel concurrency so I have created a very simple program to test it:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void K1()
{
	printf("Hello from K1!\n");
}

__global__ void K2()
{
    printf("Hello from K2!\n");
}

int main()
{
	cudaStream_t stream1, stream2;
	cudaStreamCreate(&stream1);
	cudaStreamCreate(&stream2);

	for (int i = 0; i < 10; i++)
		K1<<<1, 1, 0, stream1>>>();		
	
	for (int i = 0; i < 10; i++)
		K2<<<1, 1, 0, stream2>>>();
	
	for (int i = 0; i < 10; i++) {
		K1<<<1, 1, 0, stream1>>>();
		K2<<<1, 1, 0, stream2>>>();
	}

	for (int i = 0; i < 10; i++)
		K2<<<1, 1, 0, stream2>>>();		
	
	for (int i = 0; i < 10; i++)
		K1<<<1, 1, 0, stream1>>>();
	
	cudaStreamDestroy(stream1);
	cudaStreamDestroy(stream2);
	cudaDeviceReset();
	return 0;
}

The output from the profiler shown in the attached image.

Now, my question is: is this really expected behavior? Is it not odd that the execution of kernel K2 needs to wait until almost all K1 have been taken from the submit queue? Shouldn’t it be logical that CUDA recognizes that the K2 kernel is in different stream so it starts executing it immediately?

Thanks,

Toni

I forgot to put cudaDeviceSynchronize() before destroying the streams but it really does not change behavior at all…

Have you tried to do something else in your kernels? Other then printf?

cc?

you have both a kernel and block limit per cc

as a subsequent experiment, increase the block (or rather grid) dimensions and again note the output

what gpu are you compiling for and running on?

I believe you are witnessing what it means to not have hyper-Q and the effect on depth-first-launch vs. breadth-first-launch.

Your initial launch sequence is a depth-first-launch – all in the same stream.

After that you have a breadth-first launch sequence - alternating streams.

There are artificial dependencies in the launch queue on pre-cc3.5 devices. Hyper-Q “fixed” this.

I modified your code slightly to get rid of the printf:

#include <stdio.h>
#define TDELAY 1000000U

__global__ void K1()
{
        unsigned start = clock();
        while (clock() < start+TDELAY);
}

__global__ void K2()
{
        unsigned start = clock();
        while (clock() < start+TDELAY);
}

int main()
{
        cudaStream_t stream1, stream2;
        cudaStreamCreate(&stream1);
        cudaStreamCreate(&stream2);

        for (int i = 0; i < 10; i++)
                K1<<<1, 1, 0, stream1>>>();

        for (int i = 0; i < 10; i++)
                K2<<<1, 1, 0, stream2>>>();

        for (int i = 0; i < 10; i++) {
                K1<<<1, 1, 0, stream1>>>();
                K2<<<1, 1, 0, stream2>>>();
        }

        for (int i = 0; i < 10; i++)
                K2<<<1, 1, 0, stream2>>>();

        for (int i = 0; i < 10; i++)
                K1<<<1, 1, 0, stream1>>>();

        cudaStreamDestroy(stream1);
        cudaStreamDestroy(stream2);
        cudaDeviceReset();
        return 0;
}

When I run this on a fermi device (cc2.0) I witness a pattern similar to what you have posted. When I compile and run on a cc3.5 device, I see this:

i.e. full overlap

Indeed, I also achieved complete concurrency when compiling and running on device that supports cc5.2. My first experiment was on cc3.0 (Tesla K10). I would dare to say that this is a pretty serious limitation for devices < cc3.5

Thank you all for your comments and suggestions…