I want to understand if there is any relation between grid size , number of blocks and number of threads / block and the hardware capability of the GPU.
How do I know that I have spawned too many threads that it would start hurting the performance rather than benefiting.
if all threads of all blocks run in parallel then why does launching a bigger grid result in slower performance. I am talking about simple kernel which just assigns a constant value to a local variable and returns.
all my measurements were taken with environment variable CUDA_LAUNCH_BLOCKING=1.