Very slow kernel launches

I’m profiling a slow application, and I’m seeing that every kernel launch’s cudaLaunch call is taking around 150-200uS. This takes the smaller kernel launches and makes them much slower than they should be.

Launching kernels is relatively expensive, but it sounds like this is an order of magnitude slower than it should be. The numbers I’ve heard are on the order of 5uS plus about 0.5uS per texture/surface. I have about 30 textures, which would account for about 15uS, which doesn’t take me anywhere near the 150uS I’m seeing. (That’s not including texture objects. There are around 20-30 of those, but they’re not used by the small kernels that are being called a lot.)

I’m not sure what to look for. What else might cause kernel launches to be this slow? I’m in Windows 7, with a Geforce 750 Ti on 347.88 drivers. The main loop is running around 8 kernels, and all but one of them are very simple. CUDA 7, but I saw this on 6.5 too. I’m compiling for SM3.0. The timing numbers I’m getting are from the MSVC CUDA profiler.

Wow. It looks like kernels actually take closer to 5uS extra per texture reference bound. That’s wildly different from what http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/ says (“up to 0.5 μs per texture reference”). Looks like I’ll need to spend the time to convert a bunch of texture references to texture objects.

But, is this really what’s supposed to happen? 5uS per texture reference is insanely expensive.

How are you measuring the time for the kernel launch? You would want to use a framework similar to the following:

cudaDeviceSynchronize(); // make sure all previous GPU activity has completed
start = high_resolution_timer();
kernel<<<>>>();
cudaDeviceSynchronize(); // wait until kernel finished
stop = high_resolution_timer();

The amount of overhead may also be distorted on Windows systems with WDDM driver due to the high inherent overhead of that driver model, and the batching that the CUDA driver applies to reduce the average launch overhead.

I’m using the nSight profiler via MSVC.

Timing after a synchronize would wait for the actual kernel to complete. It’s not the kernel that’s standing out in the profile, it’s the actual async kernel calls.

You can leave out the cudaDeviceSynchronize() after the kernel is you want. On a Linux system, with a modern CUDA version, using a null kernel, you would find that each launch takes about 5 usec, and 20 usec with the cudaDeviceSynchronize() added back in. If you then make the kernel launch more complex by adding bound textures, and time the increment due to each additional texture, I think you will find pretty much the timing stated in the blog article.

If you now repeat the experiment on Windows, with the WDDM driver model (which is what you are stuck with when using a consumer GPU) you will see something like this: Already with a null kernel, timing is all over the place. From 10 usec to 80 usec for the launch only, that is, without a call to cudaDeviceSynchronize(). Average across many launches maybe 20 usec. Add back the call to cudaDeviceSynchronize (but still using null kernel) and some kernel executions will clock in at significantly above 100 usec. You are observing the effects of (1) much higher overall overhead in the WDDM driver model, (2) launch batching in the CUDA driver making the execution times uneven, but reduced overall and average WDDM overhead.

I have not performed a test under the WDDM driver model with many bound textures, but it stands to reason that the overall overhead per texture added will remain significantly higher than what is seen with the Linux driver, or the TCC driver on Windows for that matter (which is a non-graphics driver that can operate without much OS interference).

In a nutshell, if driver overhead is important for your use case, you would want to use either Linux or a professional GPU with TCC driver under Windows. Don’t know about Mac OS X.

Note that the performance of the host system has some influence on the launch overhead. Usually minor in my experience. The above are timings I observed on modern fast Ivybridge and Haswell based desktop machines with PCIe3 interface. You may see higher times on a slower laptop operating in battery mode.

BTW, the GTX 750 Ti is an sm_50 device, you probably would want to build native code for that architecture rather than sm_30 to avoid any possible overhead caused by JIT compilation (although in my understanding that overhead should be incurred at CUDA context initialization time).

i think njuffa has by now comprehensively addressed the issue

zewt mentions using to profiler for timing purposes

i now wonder whether this is indeed possible - accurate - given that the profiler and debugger normally sets CUDA_LAUNCH_BLOCKING, forcing kernel calls to be synchronous

The kernels aren’t running synchronously, but they’re indeed behaving oddly: a bunch of cudaLaunch calls are made, but the kernels don’t start until the CPU is blocking on a GPU readback, at which point all of the kernels kick off (after another 150uS delay). I’ll need to check the timing manually to see how much the profiler is skewing things.

The short answer is that the blog post isn’t correct in practice: the overhead per texture isn’t 0.5uS on a typical user’s system. The solutions are the same (switch to texture objects, and try to find ways to reduce the number of tiny kernel launches), but I wanted to make sure there wasn’t something unusual going on before spending the time doing that.

On WDDM the CUDA driver does not submit the work to the GPU until you exceed a set size in bytes of commands to the GPU or until you request the work to complete. If you call cudaEventQuery(0) you can cause an early submission of the buffer. For additional information see https://devtalk.nvidia.com/default/topic/548639/is-wddm-causing-this-/?offset=1.

nvprof, CUPTI, Visual Profiler, and Nsight VSE CUDA profiler do not set CUDA_LAUNCH_BLOCKING. When performing kernel profiling the profilers do call equivalent of cudaDeviceSynchronize between launches as the profilers have to collect the PM counters. However, when you collect trace information the profilers are minimally invasive. Timing measurement is as close as possible to the first instruction to the the completion of the kernel. This is different from using cudaEvents (CUevent). cudaEvents will have a greater duration as they will also include any deferred commands by the driver, can contain time for context switching on CPU and GPU, and will include launch setup (arguments, bindings, …).

Greg, thanks for the pointer. That actually doesn’t seem to work in CUDA 7 (it had no effect), but I got it to work by creating an actual event and querying that. The dispatch cost is still a big factor, but this lets me mask a bit of it.