Second kernel run is faster than first run

alexca · September 27, 2016, 8:44am

Heya,

I discovered the following behaviour:
I have a kernel that takes a few 100mb of global memory of input data. I have a function that initialises this data (cudaMalloc, cudaMemcpy and then, after the kernel finished, cudaFree).

However, if I call that function a second time (for the same data), the kernel run and the initialization are much faster than in the first run. Why is that? In my opinion, since the data was free’d after running the kernel for the first time, there shouldn’t be any remarkable speedup because the whole cudaMalloc and cudaMemcpy has to be done again.

Note: This only happens if I call my procedure twice in a program run. If I call it only once in the program but start the program twice in a row, both runs are “slow”.

tera · September 27, 2016, 9:21am

There are a couple of possible explanations, but the most significant slowdown on the first kernel invocation comes from the need to just-in-time-compile PTX code to SASS instructions if no code four your GPU architecture is present. Make sure you include binary code for your architecture on compilation.

LongY · September 27, 2016, 3:06pm

Another possible explanation would be context initialization for the first function call. You can view this context initialization using NVVP. It gives you a clear view why the first function call is slower than the second one.