Why does the parallelization make GPU utilization become lower?

I write a simple CUDA program to verify whether parallelization can boost GPU utilization (enable --default-stream per-thread):

#pragma omp parallel for
    for (int i = 0; i < array_size; i++)
    {
        while (1)
	{
	    dgt_mul<<<gDim, bDim, 0, st>>>(......);
	}
    }

When program spawns 2 threads, the GPU utilization can be more than ~50%; but if spawns 3 threads, the utilization downgrades to ~30%.

I limit the iteration count and try to profile it:

cudaProfilerStart();
    #pragma omp parallel for
	for (int i = 0; i < array_size; i++)
	{
		for (int j = 0; j < 1000; j++)
		{
		    dgt_mul<<<gDim, bDim, 0, st>>>(......);
		}
	}
        cudaProfilerStop();

The following result is about 2 threads:
https://i.stack.imgur.com/vvcqz.jpg

While this is for 3 threads:
https://i.stack.imgur.com/piYVS.jpg

In 2 threads case, the parallelization seems OK, while for 3 threads case, it becomes nearly serial actually. I am not sure whether because cudaLaunchKernel becomes bottleneck.

Could anyone give some clues on this phenomenon? Thanks very much in advance!

P.S., this issue is posted in https://stackoverflow.com/questions/55177474/why-does-the-parallelization-make-gpu-utilization-become-lower, but no one answers. So I reposted it here, thanks!

The reason should be the bottle neck of kernel launch queue (Please check my post: Parallelization may cause GPU utilization become worse | Nan Xiao's Blog).