Kernel function calls in regards to cudaSynchronizeDevice();

shincodex · May 25, 2017, 2:47pm

I read in a article that doing that multiple kernel calls were not execute. Not sure how old of a SDK they used or what not but I want to answer this question:

What happens if I do this:

MagicalUnicorns<<<3, 5>>>(x, y, somegpuptr);
MagicalUnicorns<<<3, 5>>>(x, y, somegpuptr);
cudaDeviceSynchronize();

vs this:

MagicalUnicorns<<<3, 5>>>(x, y, somegpuptr);
cudaDeviceSynchronize();
MagicalUnicorns<<<3, 5>>>(x, y, somegpuptr);
cudaDeviceSynchronize();

Is there any performance gains between the two, anything illegal, or will one just run after the other. Im trying to work this imaginary function the most optimal.

Thank you.

bha4395 · May 25, 2017, 4:45pm

The best option would be to just try it honestly.

I doubt you will observe any difference unless you used multiple streams though.
Worse case, the second version ends up taking longer but it depends on whether or not you have to transfer data to the gpu or transfer data from the gpu.

Robert_Crovella · May 25, 2017, 7:31pm

The first is at least slightly better. A kernel launch has 2 parts:

latency (“overhead”)
duration of kernel execution

When you don’t put a cudaDeviceSynchronize between the two back-to-back launches, then the latency of the 2nd launch can be overlapped with the latency or duration of the first kernel launch. A kernel launch by itself is asynchronous, so the control is returned to the CPU thread immediately after beginning step 1, meaning the second kernel launch can begin to “queue up” while the first launch is “queueing up” or “executing”. Stated another way, step 2 of kernel 2 cannot begin until step 2 of kernel 1 is complete, but step 1 of kernel 2 can begin while step 1 or 2 of kernel 1 is still executing (in this case).

With the cudaDeviceSynchronize in between, the CPU thread is forced to halt at that line of code until step 1 and step 2 of the first kernel launch are complete. Then it can proceed to the 2nd kernel launch, so there is no opportunity to “hide latency” of the 2nd kernel launch. This it will take (at least slightly) longer. The exact effect may vary based on launch parameters, number and size of kernel arguments, the underlying OS, the CUDA version, the GPU, what else is going on in the CUDA threads maintained by the driver, and the phase of the moon.

But the variability here is probably on the order of 0-50us.