Please help me understand this GK210 spec

According to this white paper:

http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

The GK210 has twice the number of 32-bit registers per SM over the GK110 (131072 vs 65536), but both architectures have exactly the same number of registers per thread block and number of thread blocks per SM.

Can you help me understand this?

Thanks!

Mike

Does it really say “number of registers per thread block”, or does it say “maximum number of registers per thread”?

There is a maximum number of thread blocks the hardware scheduler in each SM can track, and there is a maximum number of threads per thread block that the warp scheduler can track. In addition, there is a maximum number of registers that can be allocated to a thread block by the hardware’s allocator. If you multiply these three numbers, you will find that that product far exceeds the total number of registers implemented in each SM. So not all theoretical combinations of “thread blocks per SM”, “threads per thread block”, and “register count per thread” are possible, due to resource constraints.

By doubling the number of registers available per SM, GK210 offers more flexibility, by supporting more combinations of “thread blocks per SM”, “threads per thread block”, and “register count per thread” than were supported on older architectures.

Thanks for the reply njuffa. I am looking at the table on page 7 of the referenced white paper. The exact terms I am looking at are labeled “32-bit Registers / Multiprocessor”, “Max Registers / Thread Block”, and “Max Thread Blocks / Multiprocessor”.

At a certain level I understand your explanation, I think, but I am still struggling to understand how kernel performance would be improved. The GK110 and GK210 have exactly the same upper limits for “thread blocks per SM” (16), “threads per thread block” (1024), and “registers per thread” (255). If for example I have a kernel limited by, say, “registers per thread”, it seems to me that there would be no improvement in occupancy moving from a GK110 to a GK210. Is this correct?

Again, thanks so much for your explanation. I am sure there must be something obvious I am missing. :)

Mike

Suppose I have a total of 100 registers in the SM.
My limits are 100 registers per SM, and 100 registers/threadblock.

Suppose I compile a code of 10 threads (per block), and each thread uses 10 registers.

I can launch exactly 1 of these threadblocks on the SM. Occupancy = 1.

Now suppose I have 200 registers in the SM.
My limits are 200 registers per SM, but still 100 registers/threadblock.

I can now launch 2 of these threadblocks on the SM. Occupancy = 2.

It really can be better, in some scenarios.