GPU architecture and warp scheduling

a.p.sexton · February 8, 2018, 10:16am

From my reading, especially of the appendices in CUDA C programming guide, and adding some assumptions that seem plausible but which I could not find verifications of, I have come to the following understanding of GPU architecture and warp scheduling.

I would really appreciate it if someone with more expertise could read this and comment on any misunderstandings they see:

============
GTX 960
CUDA capability 5.2
8 multiprocessors, 128 cores/multiproc, 4 warp schedulers per multiproc
Max 2048 threads per multiproc
Max 1024 threads per block
GPU max clock rate: 1.29GHz

Blocks are assigned to a multiproc

Thus, with 1024 threads per block, 2 blocks can be live (“in flight”) on a multiproc. More if you have less threads per block.

When a block is assigned to a multiproc, the warps of the block are distributed statically among the 4 warp schedulers.

With max 2048 threads per multiproc, i.e. 64 warps, each scheduler gets at most 16 warps (possibly from different blocks).

Each scheduler issues 1 instruction from one warp per cycle, if it has a warp ready to execute. So if the current warp is not ready (e.g. waiting on a memory transaction or an FP function unit), the scheduler switches to an alternative warp assigned to this scheduler that is ready without costing any delay if such a warp exists.

The advantage of each scheduler having upto 16 warps to schedule means that you can cover quite a bit of latency while waiting for a delayed instruction to complete (mem, function unit etc.) by switching between the other 15 warps.

If a scheduler does not have warp ready to execute, it can not “steal” a ready warp from another scheduler.

Thus in the theoretical optimal case, if you have a kernel with 32*N threads in total, and the kernel is K instructions long, you could potentially execute the kernel in N * K / (8 * 4) cycles at maximum 1.29 GHz (8 multiprocessors, 4 warp schedulers)

=============
GTX 1080 Ti
CUDA capability 6.1
28 multiprocessors, 128 cores/multiproc, 4 warp schedulers per multiproc
Max 2048 threads per multiproc
Max 1024 threads per block
GPU max clock rate: 1.68GHz

Same assumptions: N * K / (28 * 4) cycles at maximum 1.68 GHz

Robert_Crovella · February 8, 2018, 4:37pm

you seem to have a theme running through related to static assignment of warps to schedulers, including statements like a max of 16 and “cannot steal”

I don’t know where any of that is documented. (Please point it out if it is documented somewhere, I may have missed it.) AFAIK the warp schedulers draw from a pool of available/ready warps, and select one or more instruction, per cycle, from each warp, per scheduler.

Also, you indicated 1 instruction from 1 warp per cycle, but most warp schedulers are dual-issue capable, as long as they can find 2 independent instructions from the same warp/instruction stream, in a given cycle.

BulatZiganshin · February 8, 2018, 6:59pm

txbob, for you it may be easier to ask inside nvidia

it’s not clearly documented. in Fermi whitepapers, register file (page 8) was pictured as monolith, but page 10 shown that left scheduler executes only even warps, while right scheduler executes only odd warps: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Kepler whitepaper also pictured register file as monolith: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

Fortunately, hardware.fr publications contained more exact picture of each SM, in particular for Kepler: http://www.hardware.fr/medias/photos_news/00/44/IMG0044011_1.jpg

Finally, starting with Maxwell, NVidia fixed their picture and started to show register file as individual per scheduler. See page 8 in documents below:

http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

And individual register file per scheduler means that warp cannot be quickly moved to other scheduler - it will require to copy all register contents. Or it will require to make all registers available to all schedulers, but this will require to make 4x more read/write ports (which is very precious resource) in the register file and in this case, it will be more logical to continue show register file as shared by the all schedulers

Robert_Crovella · February 8, 2018, 7:15pm

I agree that the individual register files on the maxwell whitepapers suggest that warps are assigned and probably do not move. I don’t have 100% confirmation of that.

The hardware.fr picture appears to show operand collectors, however, which may have some role in register data exchange.

Anyway, my purpose on these forums is not to take material non-published information about our GPU hardware and make it public. I would get fired for doing that, quick.

So if I can’t ascertain something from previously disclosed public information, it is rare that I would expose any non-public information here.

And in many cases I simply don’t know the answer and it serves little purpose for me to go and discover the answer within NVIDIA. Slideware has its limits, and non-published specifications are subject to change from chip to chip, from OS to OS, and from CUDA version to CUDA version, and in some cases from driver version to driver version.

Anyway I appreciate the collection of references, BulatZiganshin. I generally would agree with your conclusions, but that can’t be construed as any sort of statement of truth or official statement from NVIDIA. It may be the case that warps are assigned “statically”. It may also be the case that the specifics here vary from chip to chip.

BulatZiganshin · February 8, 2018, 7:16pm

each warp scheduler contains multiple engines:

int/fp ALU
SFU (best throughput is 4 cycles/operation)
ld/st unit (best throughput is 4 cycles/operation)
branch unit

Warp scheduler can execute two operations in the same cycle if

they are independent
they are processed by different engines
previous operation in corresponding engine was finished to issue. F.e. you can issue new ld/st instruction only once per 4 cycles

Overall, it’s easier to just consider all but main ALU operations as free, but take into account their limited throughput. If your program provides enough parallelism, it’s compiler’s duty to rearrange instructions as much as possible to fill the engines busy.

It’s how things work on maxwell/pascal. Volta is even easier - it’s single-issue, but there are more throughput limits. Kepler is more complicated - each pair of schedulers shared extra ALU engine, so in each cycle one of them can dual-issue two independent ALU instructions (while paired scheduler can issue one ALU plus one non-ALU instruction). The pair also shared ld/st ans SFU engines as you can see here:

A few extra detailed pictures:

And one of articles exposing them (just scroll down a bit):

a.p.sexton · February 9, 2018, 12:14pm

In the latest Cuda C programming guide v9.1.85, appendix “H.6.1. Architecture” it says:
“A multiprocessor statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.”

On the previous compute capability sections (including that for the 1080Ti and 960), it just says:
“When a multiprocessor is given warps to execute, it first distributes them among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.”
This I took to be indicative of static assignment, but not definitive.

Interesting. I remember reading about that somewhere else, but the above quotes from the C Programming Guide are pretty explicit that “at every instruction issue time, each scheduler issues one instruction for one of its assigned warps”. Is this an error in the documentation? Or do that mean that most warp schedulers have 2 instruction issue times per cycle?

Robert_Crovella · February 9, 2018, 2:52pm

I said “most”

Volta (sm_70) went back to single-issue, I believe (and also doubled the number of warp schedulers per SM, compared to a sm_60 SM). NVIDIA talks about the reasons for this in such presentations as GTC 2017 Inside Volta (you may have to listen to the recording).

Fermi 2.0 was not dual issue either, although 2.1 was dual-issue capable. Kepler, Maxwell, and Pascal should all be dual-issue capable, I believe.

THe kepler description in the programming guide certainly indicates this:
[url]Programming Guide :: CUDA Toolkit Documentation

Here’s a comment from Greg at NV indicating Kepler and Maxwell are dual-issue capable:

[url]Understanding CUDA scheduling - CUDA Programming and Performance - NVIDIA Developer Forums

I admit that seems to contradict the wording in the cc 5.0 programming guide description.

I don’t have a crisp explanation for every reference you have found, but thanks for pointing those out.

I certainly would like to retract my statement about warp assignment. I agree that in Volta the indications are that it is static, with no migration. At some point I think this must have changed, but I’m not really sure. Maybe it has been static assignment all the way back to Fermi.

BulatZiganshin · February 9, 2018, 7:08pm

“at every instruction issue time, each scheduler issues one instruction for one of its assigned warps”.

this formulation may remained from the CC 1.0 times :) GeForces made an interesting tour from Tesla (CC 1.0) devices to Volta:

CC 1.x - single-issue schedulers, 2 cycles/instruction throughput. I.e. each shcheduler (they were called SM in these GPUs) had 8 main ALUs plus 8 SFU ALUs, but they were worked at 2x frequency. So, each scheduler cycle, they are processed 16 items. Scheduler can start one ALU instruction, at the next cycle SFU instruction and at the next cycle ALU instruction again. CC 1.0 SFU was pretty unusual, being capable of FMUL, so entire scheduler can perform 3 FP operations per 2 cycles (FMA+FMUL).

CC 2.0 - single-issue schedulers, 1 cycle/instruction throughput. SM had two schedulers, which shared LD/ST engine with 1 cycle/instruction throughput. Each scheduler had 16 ALUs working at 2x frequency, so it can start ALU instruction each cycle, or LD/ST instruction, but two schedulers can’t start LD/ST instructions simultaneously.

CC 2.1 - double-issue schedulers, everything as 2.0, plus extra ALU engine shared by two schedulers in the SM. So, each scheduler can issue ALU+ALU or ALU+LD/ST pair on each cycle, but together, they had only 3 ALUs and 1 LD/ST engine which limited coissue possibilities.

CC 3.x - same as 2.1, but one SM now contained equivalent of two CC 2.1 SMs. It was the worst one of NVidia architectures, since there were only 48 KB of shared memory per all 4 schedulers in one SM, which limited both capacity and performance of shared memory access. Also, coissuing to two ALUs was seriously limited by small amount of R/W ports in the register file. It seems that Fermi was less limited because its ALUs were working at 2x frequency, thus required 2x less ports on the register file.

OTOH, in 2010 NVidia was seriously behind AMD in terms of ALUs per GPU (512 in top Fermi vs 1600 in top AMD), and Kepler allowed to significantly improve that by sacrificing everything else (1536 ALUs in GeForce 680 vs 2048 in Radeon 7970). In the next gen NVidia fixed its mistake by reducing amount of ILP:

CC 5.x, 6.x: double-issue schedulers, each having its own set of 32 ALUs, 32 branch engines, 8 LD/ST engines, 8 SFU engines. As you can see, it’s pretty close to CC 2.0 architecture. No more ALU+ALU coissue which require A LOT of register ports (each ALU operation can use up to 3R+1W port f.e. for FMA operation, while branch instructions don’t access main register file at all and other operations have much smaller throughput).

CC 7.0: single-issue schedulers, 2 cycles/instruction throughput. 16 FP32 ALUs, 16 INT ALUs, 16 (?) branch engines, 8 LD/ST engines, 4 SFU engines. But wait, it’s damn close to CC 1.0! The main difference is that ALUs is no more work at 2x frequency, so the register file needs 2x more ports.

So, in last 10 years, NVidia got from high-end HPC-oriented Tesla architecture with a lot of resources per ALU, to the low-end Kepler, and almost returned back with Volta. Their attempt to sell HPC gpus to gamers was a pain, and AMD cleverly used this situation in 2008-2010, but NVidia early push of GPGPU now pays back with popularity of CUDA vs OpenCL.

The simplest way to track all these changes are pictures of various SM generations, especially the improved pictures provided by hardware.fr - once you learn how to interpret them, you will find there most of the info I provided here, and few bits on top of that (in particular, shared/L1$/Texture$ memory size and access).

njuffa · February 9, 2018, 7:37pm

Allow me to be a bit skeptical. Do we know the source for the additional information incorporated into hardware.fr’s improved diagrams? Did they design dozens of clever microbenchmarks to suss out the details, i.e. use reverse engineering?

BulatZiganshin · February 10, 2018, 3:02am

CUDA C++ Programming Guide descriptions are hard to interpret otherwise than “scheduling was always static”:

3.x: When a multiprocessor is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues two independent instructions for one of its assigned warps that is ready to execute, if any.

5.x: When a multiprocessor is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.

6.x: When a multiprocessor is given warps to execute, it first distributes them among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.

7.0: A multiprocessor statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.

I see that hardware.fr corrected some inaccuracies of original NVidia pictures so they better match what I know from other sources. They also contains other changes that I can’t clarify, but they look reasonable (f.e. split Tex$). So, i prefer them over original NVidia pictures.

Note that this particular question (whether scheduling is static or dynamic) doesn’t seem secret, just poorly-documented, so it should be easy to clarify by reaching someone inside NVidia.

njuffa · February 10, 2018, 3:51am

For clarification of existing documentation, you would want to file an enhancement request with NVIDIA (via the bug reporting form linked from the registered developer website; prefix the synopsis with “RFE:”).

NVIDIA has a long history of keeping the details of their hardware architecture secret, or publicly revealing only an absolute minimum of information. The description of the hardware instruction set (SASS) would be a good example of that.

txbob already explained above that “clarify by reaching someone inside NVidia” is not the way these kind of questions can get resolved in reality.