What are the types of parallelism on GPU

Hello, all!
I just entered this area. I read some materials on GPU computing.
I’m really confused by the concept of parallel on GPU.

The following are some of my questions:

What is the type of parallelism within one warp?
Is it pipeline? or totally parallel, in other words, each core execute a thread?

What is the type of parallelism among multiple warps on one Streaming Multiprocessor?
Are they just time sharing? For example, 16 warps on a Streaming Multiprocessor, then Warp 1 uses the first second, Warp 2 uses the second second, Warp 3 uses the third second, and so on. And after one round, Warp 1 uses the 17th second, Warp 3 uses the 18th second, and so on. Thus, at any instant, there is only one warp executing on the Stream Multiprocessor?

Or are they using the pipeline parallelism?

Thank you all in advance!

Depends on the hardware. G80 executed a warp in 4 clock ticks, for example. In general, the programmer should ignore such details and just assume that the entire warp is executed in parallel.

It is time sharing, though much more capable than the round-robin you describe. The SM can sleep warps that are waiting for memory operands to arrive and execute the warps that are ready. There is zero overhead to context switching, since there are enough registers for all running threads.

Thank you so much for your explaining!

But I still don’t understand your answer for the first one. What do you mean by “executed a warp in 4 clock”? Is it executing in a 4-stage pipeline fashion?

Besides, what does the SIMD mean, physically? for example, 32 in CUDA.

I know a Streaming Multiprocessor has 8 scalar processors. then, a 4-stage pipeline results in the 4*8= 32 SIMD size?

Can I say that there are at most 32 threads executing on a Streaming Multiprocessor at any time instant, ignoring the timing sharing parallelism?

The pipeline is much deeper. Compute capability 2.x GPUs have an approximately 20 stage pipeline per CUDA core, and compute capability 3.0 GPUs have approximately 10 stages. The warp scheduler in a multiprocessor can fill the pipelines with instructions from any combination of the available warps. As a result, the number of threads executing at any time instant is very large, and somewhat hard to define. There is no time slice-based context-switching between threads like on a multicore CPU.

The SIMD aspect of of CUDA comes from the fact that the warp scheduler issues a single instruction for a group of 32 threads at a time. In older GPUs, the 32-wide SIMD instruction was dispatched to 8 pipelines, but in newer GPUs the same 32-wide SIMD instruction dispatched to 16 or 32 pipelines. Given the overall length of the pipeline, these minor differences don’t affect the instruction latency significantly.