why in thread context switching there is no need to store state?

First of all: is there context switching in the execution of a thread? or it starts and finishes its code before other thread takes its resources?

In case there is context switching:
lets say I have the next function:

global myfunct(int* array)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
array[idx]=(idx+1)*2;
}

lets say that a thread that is executing myfunct loses its processor after it calculated “idx+1” (but it hasn’t calculated the *2 part yet). Where is the result of “idx+1” stored? doesn’t it generate overhead? I mean, that would be to store state which supposedly CUDA doesn’t do.

There is context switching, any time a warp scheduler switches from one thread to another. This thread-switching behavior is desirable as it is part of the latency-hiding process.

A large part of the a thread’s state is contained in its associated registers (there is only a small amount of additional state, such as the instruction pointer, stack pointer, and flags). A thread’s register state does not need to be moved during a context switch because there is enough register storage for all threads/warps that are currently resident and schedulable for execution. This is one of the reasons why GPUs have relatively large register files (e.g. 16K+ registers). Since there is a unique (physical/HW) register for each logical register for each (schedulable) thread, no movement of register data is required during a “context switch”.

This is also why register usage can be a limiter to “occupancy”. In general, we want as many warps/threadblocks as possible to be “resident”, i.e. schedulable for execution, as this increases our ability to hide latency. But each additional schedulable warp carries with it a register “footprint” which must be provided for in the SM registers. When we run out of register storage, then no more warps/threadblocks can be scheduled on that SM, until some of the register usage is released by retiring warps/threadblocks. Therefore imposing limits on register usage per thread may help to drive up “occupancy”, i.e. the number of schedulable warps/threadblocks that can be simultaneously resident on the SM. Since this helps with latency hiding, this may be a good thing. On the other hand, limiting register usage may lead to register “spilling” by the compiler, which will increase memory traffic. So there may be a tradeoff to this method of increasing “occupancy”.

As a clarification to the above, warps cannot be individually made resident on a SM. The unit of residency on an SM is the threadblock, not the warp. No warps from a threadblock will become resident on an SM until there are sufficient resources for the entire threadblock. At that point, the entire threadblock (all of its warps) may become resident on that SM.