Doing this in separate kernel launches will include reading / writing to global memory overheads ( 100’s of cycles ). Best thing to do here is to compute all the stuff in one kernel, keep temporary results in registers and write results just once.
The memory usage overhead is one reason why I chose überkernel way, but what if, in the end, the kernel contains like 10,000 lines of code (all calls inlined) ?
Maximum kernel size (the limit is on the kernel, not thread), is 2000000 assembly instructions (I don’t think that that changed with Fermi).
The thing that you may need to watch is instruction cache pollution. You don’t want too much code inside an if conditional where the block diverges as it causes instruction cache pollution that can degrade performance. It can also cause issues if you have multiple blocks per multicore and they diverge.
Whether it’s better to split to multiple kernels or use a single überKernel depends on your actual code. Going to global memory is very expensive, generally much more so than instruction cache pollution, but there are exception.