Deep dive in concurrent kernel launches

  1. How does a GPU determine whether two kernels “A” and “B” can run concurrently ( they are computationally independent )? i.e. what is a resource(s) that the GPU inspects to allow concurrent kernel launches ( ex. SMs, required shared memory … )? I know this that they don’t look at global memory since what often happens is two kernels end end up trying to allocate more than the available DRAM and a fatal error rises.

  2. For example, let’s assume: process A requires 6 SMs, process B requires 15 SMs, and we only have GTX 1080 which has 20 SMs. However, because process B was very poorly designed, only 10 SMs do computation while the other 5 SMs wait ideally for the other 10 to finish. If we first launch process A attempt to concurrently launch process B, then is our GPU smart enough to realize that there is no computation in the blocks that occupy the 5 blocks, and therefore allow both processes to run concurrently?

  3. for nested kernel launches, if process A tries to launch a process B, how does the GPU determine whether this is viable? is the same schema applied as the answer to question 1 does?

  1. The thought process here is similar to one of occupancy. I suggest you study occupancy and how it determines and limits the number of blocks that can be simultaneously resident on a SM. This sort of capacity consideration is one of the requirements for kernel concurrency. The availability of “room” on a SM for more blocks to be scheduled is one of the factors to determine the possibility of kernel concurrency.

  2. A process can’t “require” a certain number of SMs. That’s not how the CUDA execution model works. A block (that is scheduled on a SM) uses a relatively fixed set of resources (registers per thread times number of threads, static + dynamic shared memory allocated, block slots, warp slots, etc.) regardless of what it is doing or not doing.

  3. For nested kernel launches i.e. CUDA Dynamic Parallelism (CDP), the number of nested launches outstanding adheres to a specific limit. An outstanding launch does not necessarily mean it is executing - i.e. it does not necessarily mean that the GPU block scheduler has scheduled one or more of its blocks on specific SMs. The GPU will go to special lengths to ensure the completion of child kernel launches, so that parent kernels that launched them (and are therefore dependent on their completion) can also complete. This includes the possibility of preemption - the removal of a block executing on a SM to make room for a child kernel block. Preemption is not typical on GPUs but does happen under some circumstances, one of those being CDP. I suggest you read the CDP section in the programming guide.

My own opinions:
In practice, kernel concurrency is hard to witness. It requires a carefully controlled set of conditions which are not typical of efficient CUDA kernel launches. I consider aiming for kernel concurrency to be mostly a misguided idea and a fool’s errand, unless you are well beyond the exploratory stages of CUDA programming, and have the concepts you are asking about mastered. Even then, designing for kernel concurrency only makes sense in certain kinds of work-issuance scenarios.

I second those opinions.

Thanks for the helpful answers, njuffa and Robert_Crovella.

So the gist seems to be: focus on how to utilize all the GPU’s resources efficiently with a single kernel launch instead of trying to have multiple kernels up at the same time.