How to effectively parallelize cuda kernel launches on CPU

My work uses 4 streams and I wish they can be run concurrently. The code is something like this:

for (int i=0; i<N; i++)   //batch numbers
{
    for (int j=0; j<4; j++)
       myCudaCode(stream[j]);    // working codes using the specified stream
}

However, from nvvp profiler I see the streams are actually not concurrently running because the CPU are fully occupied by the kernel launches. I did not use any cudaDeviceSynchronize. You may see the figure from the following link.

I understand that all my kernels on GPU are pretty small, comparable to the kernel launch times on CPU. But so far we do not intend to change them. From the above figure, I see most kernel launches on CPU take around 5~10 us, which is considered normal. The whole processing time for one batch is around 0.4 ms (as shown in the gray)

An intuitive thinking to optimize the code is to use multi-threading to parallelize the CUDA kernel launches on CPU. Here is what I did by use of openMP:

for (int i=0; i<N; i++)
{
    #pragma omp parallel num_threads(4)
       myCudaCode(stream[omp_get_thread_num()]);
}

Now the nvvp profiler shows like this:

The four streams are seemingly running concurrently. However, for each CPU thread, the CUDA kernel launches are not no longer as compact as before, and also are significantly stretched (typically 20~30 us). The resulting times required for one batch processing (shown in gray) are now around 0.5ms, even longer than the single thread case.

I also tried pthread method. It shows the similar problem.

So I’d like to ask for an effective way to parallize the kernel launches on CPU. Ideally, the times are expected be reduced by one fourth.

GPU kernels can only run concurrently when each individual kernel does not fill the GPU completely. Are your GPU kernels sufficiently small that they do not fill up the GPU individually?

What OS are you on? If it is Windows and you are using the default WDDM driver, kernel launch batching performed by the driver for performance reasons may prevent the kernels from running concurrently even if they are small enough.

If it were possible to concurrently run kernels that are able to fill the GPU by themselves, the overall throughput of the processing pipeline wouldn’t improve.

Your initial approach of using non-null streams to achieve concurrency is appropriate assuming the kernels are small enough to facilitate that. Starting the kernels from different CPU threads is not helpful as all you are doing is adding host-side overhead.

I’m pretty sure every kernel is small enough, far away from the full GPU computing resources.

I’m using Linux, i7 8 cores CPU and GTX 1070 GPU.

I don’t understand why host-side overhead of starting the kernels from different CPU threads will be that large? Though I did not give a zoomed-in figure, actually there are 14 kernels launched in each stream. From the nvvp profiler, not only the first kernel launch of each thread but all the kernel launches undertake significant latency. Also, every kernel launch time changes from 5~10 us to 20-~30 us, which does not seem normal to me.

So if we do not intend to change the kernel functions, did you mean there is no way to effectively parallelize the kernel launches in order to optimize the performance? Because in the future we may want to increase to 16 steams or even more. It is not ideal that all the streams have to be serialized.

I don’t know what your kernels look like or how they are configured, so I can’t give feedback on whether they should be able to execute concurrently. CUDA ships with a sample application that demonstrates concurrent kernels (named concurrentKernels), which can also be downloaded individually:

[url]CUDA Samples :: CUDA Toolkit Documentation

Have you had a chance to experiment with that app? I have not tried the profiler with this app, so I don’t know what to expect. The profiler may be intrusive enough to make an observation of concurrently executing kernels unlikely. IMHO, the easiest way forward would be to give that app a try and then use it as a template for your own work.

I noticed that there are a sequence of ~10 kernels running in each stream, in the single thread case.

That doesn’t seem to match the dispatch you’ve indicated:

for (int i=0; i<N; i++)   //batch numbers
{
    for (int j=0; j<4; j++)
       myCudaCode(stream[j]);    // working codes using the specified stream
}

I can’t really match the two. Doesn’t seem to be enough information here, or else what you’ve shown code-wise is not really representative of your actual code in the single thread case.

Sorry if there is any confusion. For

myCudaCode(stream[j]);

it is not just a kernel function. It is a host function that actually contains a lot stuffs. It contains ~10 kernel calls and some host-side codes. But all the kernel calls are associated with the specific stream with the index j.

Hope it can clarify.

I looked at the sample code. It involves unified memory and therefore the nvvp profiler looks wired and is not easy to interpret. Sorry that I do not have too much knowledge on UM. But I noticed that their kernel launch on CPU are also typically as long as around >30 us.

At a first glance, that sample also used a simple openMP structure which I do not see much difference from mine.

Update: based on my experiments, it seems the use of multi CPU threads do not reduce the total kernel launch times at all. Suppose the single thread code for processing N streams requires time T, then using openMP and the N-thread code for processing N streams (one-stream for one-thread) will also approximately require time T. As the figure I posted, even though the streams are now seemingly concurrent, but every kernel launch latency becomes also significant. Interestingly, the total time (for one batch or N streams) therefore remains approximately unchanged.

I suspect that openMP+CUDA may only work for multi-GPU computing, but for single-GPU computing it is not working as we expect. Hope the NVIDIA people can check your internal log to see if this problem has been previously reported.

it’s not a problem. it’s how things should work. you have only one GPU (or N GPUs, doesn’t matter). Why filling it (them) with work from multiple CPU threads, or even multiple CPUs can change the time that this GPU(s) need to execute kernels? amount of the work is the same, it’s independent on how much CPUs are sending this work to GPU(s)

You may not have looked into the profiler image I posted. When I used one only CPU thread, the four streams do not overlap, which means we did not fully make use of the GPU computing ability. Since in my work the computational task in each stream is pretty small, so it is very much needed if we can overlap the streams to save time.