How to choose grid size ? No. of blocks and threads ?

I want to understand if there is any relation between grid size , number of blocks and number of threads / block and the hardware capability of the GPU.

How do I know that I have spawned too many threads that it would start hurting the performance rather than benefiting.

if all threads of all blocks run in parallel then why does launching a bigger grid result in slower performance. I am talking about simple kernel which just assigns a constant value to a local variable and returns.

all my measurements were taken with environment variable CUDA_LAUNCH_BLOCKING=1.

gpu can run ~10-50K threads simuktaneously. if you ask more, it starts new threads as older one finishes. if you have some kernel with surprising perfromance, you can ask here instead of speculating about gpu internals