speed of device and global functions

Good time of a day! I am writing the program, which has heavy computations in several global functions that work in a loop. After using nvprof it is shown that the launching of global functions takes much more time than computations. So, the question is the next: may be it is better to use several device functions inside one global function, that launches them, instead of call several global functions from host?

“After using nvprof it is shown that the launching of global functions takes much more time than computations”

if only nvprof could show you why this is…

global functions are more flexible than device functions - you can not change the ‘environment’ of device functions, this is adopted from the parent kernel
and you can forward issue global functions, but not device functions, so the former works very well with streams

on the other hand, device functions should increase readability and device autonomy

by ensuring your memory copies are asynchronous (using pinned memory as and where required, etc), increasing the work done per kernel (via its dimensions), and attempting to forward issue as much work as possible, you could (better) hide the kernel launching

otherwise, device functions are perhaps the best choice