CUDA threads and warps

Hello all ,

I am new to GPU programming . i have some questions regarding basic concepts cuda and gpu hardware

If we assign threads , in runtime these threads are again divided into warps . I am quite confused about how gpu execute instructions . Is that the thread executes one instruction or a group of threads(warps ) execute one instruction ?? And how cuda cores involve with the process when executing an instructions.

And i am using Jetson TK1 , i have read that it has only one SM. so how many blocks does that SM have?

Thank you

Okay, so I’m new at this too, but I’d like to try to help as much as I can.

The Jetson TK1 has a single Kepler SM (192 CUDA cores).

When you launch your kernel, the GPU will map each block onto that SM. The scheduling is done automatically. I am not sure if all threads in the Block must complete execution before another Block can run on the SM. I would assume all threads in the block much complete before applying another block to the SM, because if another block was scheduled to execute on the SM then the unfinished threads in that first block could not be executing.

So inside each block that is executing on the SM, the block is split up into Warps (groups of 32 threads). Now I know that if one of the threads in that warp must wait on another thread from another warp to finish, then the warp will be “context switched” with another warp, and these warps will be moved back and forth which the GPU does all this scheduling for you.

Each thread is mapped to a single CUDA core which inside that thread the SINGLE INSTRUCTION that is performed is the operations defined inside the kernels body. This is the concept of SIMD. The same istruction (the kernel) is executed on MULTIPLE DATA (the input arguments to the kernel that is executed as that specific thread on one singe CUDA core).

Does that help at all?

Thank you very much for the info.

every thread on a kernel is mapped to a specific core , and thread will use the resource of CUDA core.IF my kernel has 192 threads, 192 threads will be mapped to 192 cuda cores(single Kepler SM). In 3.x compute capability GPU , there can be maximum 1024 thread per block. So what happens to the excess threads .( 1024-192)?

Please let me know if my understanding are wrong.

As long as the first 192 threads have some work to do, they will keep running. As soon as one of the 192 currently running threads needs to wait, for example on load store units(threre are just 32) another thread will be started, and the waiting one is simply suspended. The reason for the limit of concurrently running threads is that the thread context needs to be stored until the threads is completed. Since there is no infinite amount of space to store the thread data, it is limited. There is no guarantee that threads 0-191 will finish first. The order can be arbitrary, depending on how they can be scheduled the most efficient way.

Always keep in mind, new threads are only started if a whole warp is suspended. If one of the 32 threads is working, the other 31 slots for threads are still blocked by the ones which are grouped into a warp with the working one.