Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m

For those of you who like to obsess over leaked, unsourced slides from gaming sites, this one has a few pictures of the Kepler architecture:

Caveats include the fact that drawings like this have been misleading in the past (leading to arguments about whether multiprocessors had 8 CUDA cores or not in the compute capability 1.x days).

If you take this slide at face value, then it looks like:

  • Streaming multiprocessors are now very, very large, and have a new name: “SMX”. (Because we can’t change the terminology too often!)
  • Each SMX contains 4 warp schedulers, 8 dispatch units, 192 CUDA cores (!!), 32 special function units and 32 load/store units.
  • The GTX 680 will have 8 SMX * 192 CUDA cores per SMX = 1536 CUDA cores.
  • To support this, the register file for each SMX has been increased to 65,536 32-bit registers.
  • L1/shared memory cache is still only 64 kB.
  • No idea what the L2 cache size is now.

I have to admit that I’m more than a little surprised by the large ratio of CUDA cores to other resources in the new SMX. In comparison, a SM on a GF100/GF110 chip has 32 CUDA cores, 32k registers, and 64 kB of cache/shared memory. If the pipelines are the same size on GK104, then hiding the pipeline latency would require 22 * 6 = Â 132 warps or 4224 threads running per SMX. That’s a lot of register pressure…

(Maybe I’m just thinking about this wrong, or this slide is bogus.)

Indeed. Seems to be optimized for gaming, just as GF104 was. I’d hope that the real CUDA chip (GK100 / GK102 / whatever it will be called) will have a better balance for our purposes.

Some thoughts… on something that could be completely fake :)

  • I saw somewhere, web source could be wrong, that the # of transistors has only slightly increased from 3 billion to 3.5 billion so if the new processor does have 1536 cores then something had to go. (I would guess) I don’t think we could fit 3 fermi’s in 3.5 Billion transistors.
  • 3 times the number of cores but the same number of registers (MUCH more register pressure)
  • 3 times the number of cores but only half of the total L1/Shared (GF100 had 16 sets and the new one has 8 sets) That is 1/6th the number of shared/L1 per core.
  • It looks like some of the room for the shared/L1 has turned into cores. (half the shared/L1 and 3 times the cores)
  • Maybe Nvidia is moving away fewer more powerful cores to many more light weight cores?
  • Maybe L2 performance has improved/grown so L1/Shared is not as important?
  • I thought Kepler was just a modified version of Fermi (not a redesign like Fermi) so I am a little bit skeptical of this drastic of a change. (but who knows)

That is a good point. If the goal is to get the 28nm process figured out with a smaller die, then this makes sense. The size of the SMX is the only thing that makes me suspect this diagram is inaccurate somehow. 96 CUDA cores per SMX (each with 64 kB of shared/L1 memory) makes a lot more sense to me than 192. The only compelling reason to slam that many CUDA cores together would be to save transistors somehow.

I don’t see this happening, except maybe as a temporary measure of desperation. William Daly (NVIDIA’s Chief Scientist) has given a number of talks in which he repeats his (and presumably, NVIDIA’s) mantra that the future of high-performance computing is about power efficient parallel architectures, and power efficiency requires data locality. (His talks are really good. Go check them out!)

If anything, I was expecting Kepler to either increase the size of the L1 cache, or possibly add another non-coherent cache between L1 and the old L2 that is shared between groups of multiprocessors. These groups of multiprocessors would correspond to partitions that you could assign to different processes. Then we could finally share one GPU between CUDA and a graphical desktop.

Some thoughts …

16 cores at “double speed” executing one 32 element wide warp is indistinguishable from 32 cores doing the same at “full speed”, as long as you do not notice any difference in latency - which very conveniently is an unknown at this point. Suppose half the cores are an artists rendition of the CUDA mental model, then that would reduce actual register pressure a lot.

Instruction parallelism might have increased similar to what happened in later incarnations of the previous generation. You would then have more cores working on the same set of registers, having direct access to each others results, which again reduces register pressure.

Shared memory looks like a bummer. I had really hoped for an increase there, or at least in keeping a constant ratio to the size of the register file.

I agree with you that it looks very problematic to have such a deep pipeline with a much lower ratio och (on-chip memory) / FPU. Are you sure about the register & L1/shared size? I’ve been looking for that kind of information but been unable to find even any rumoured numbers…

It certainly looks like your going to have to really optimize your kernel to use MUCH less registers, 65536 register / 4224 active threads ~ 15.5 register / thread . Thats down from 128 on GT200 and 64 on Fermi…

Another matter is that GK104 has a 256 bit bus @ 6 GHz => 192 GB/s bandwidth to support 3250 GFLOPs worth of compute power. That’s double the GTX580 compute power but at the same bandwidth. Looks to me like we’re going to be bandwidth bound for many applications.

This might be a product aimed more at gaming while its rumoured big brother part GK100 ( ? ) coming later this year might be aimed more at compute applications.

Anyways I really hope it’s going to sport more registers and shared memory than that : /

EDIT: The register file size is actually right there in the picture :)

I’m not surprised by the decreased registers/core nor the huge increase in “SIMD” width.

These are consistent with Bill Dally’s presentations.

  1. Huge register files are power inefficient. To allow smaller register files without starvation, you can decrease the number of concurrent threads by pipelining the CUDA cores to decrease the latency between instructions. Based on the 24 cycle instruction latency, my understanding is current CUDA cores are completely unpipelined w.r.t a single thread (pipelined w.r.t other threads however). So pipelining seems to be low hanging fruit. I’ve designed a single threaded processor with pipelining/forwarding and it wasn’t that hard to do. For a core that has to handle 512 threads, I imagine pipelining could be much more difficult due to all the extra baggage?

  2. Increased “SIMD” width. Bill’s analogy of ALUs being like kids and pets makes sense since you have this huge instruction cache feeding a single tiny ALU. So sharing the instruction cache across more cores will be a win. The main concern will be thread divergence. I speculate NVIDIA will introduce a limited form of MIMD parallelism into the SMX unit (i.e. allow multiple instruction streams either within a warp or across warps). I suppose they can have an instruction cache with multiple read ports so that a few instructions can be fetched each cycle. This would be a saving compared to having multiple separate instruction memories.

A limited form of MIMD would greatly benefit limited divergence codes like graph searches and database operations. No more need to do SIMD compaction in software to get maximum performance.

I guess my concern is that the large register files were really important in first generation CUDA because register spills required going to (comparatively) slow local memory. With Fermi, that was mitigated by adding the L1 cache, which speeds up local memory access. In this leaked Kepler design, the thread count to L1 ratio is likely to be much higher, which will push more local memory accesses out to the L2 cache or global memory. I don’t have a good feel for the amount of local memory people need per thread in typical applications, so maybe this isn’t a big deal.

[/quote]

  1. Increased “SIMD” width. Bill’s analogy of ALUs being like kids and pets makes sense since you have this huge instruction cache feeding a single tiny ALU. So sharing the instruction cache across more cores will be a win. The main concern will be thread divergence. I speculate NVIDIA will introduce a limited form of MIMD parallelism into the SMX unit (i.e. allow multiple instruction streams either within a warp or across warps). I suppose they can have an instruction cache with multiple read ports so that a few instructions can be fetched each cycle. This would be a saving compared to having multiple separate instruction memories.

A limited form of MIMD would greatly benefit limited divergence codes like graph searches and database operations. No more need to do SIMD compaction in software to get maximum performance.

[/quote]

Agreed here on the MIMD. Rigid SIMD makes particle propagation algorithms more complicated as well.

It looks like there are some new reviews out this morning for Kepler and the leaked stuff above was accurate.

Seibert was right about the reduced cache. It looks like there will be 512KB L2 cache.

This is one article: [url=“http://www.maximumpc.com/article/features/keplar_unveiled_nvidias_gtx_680_benchmarked_-depth”]http://www.maximumpc.com/article/features/keplar_unveiled_nvidias_gtx_680_benchmarked_-depth[/url]

Question: Does it look we will have 3 to 4 registers on average per core? 192 cores X ~24(active warps) / 65536(memory) = 14.2 bytes/core = 3.5 registers per core.

BTW - there is another “leaked”(maybe not real) item out there for the GK110 GPU: 6billion transisters, 2304 cores,
http://www.legitreviews.com/news/12673/

EDIT: Thank kleboeuf, I thought it was 65536 bytes and not full 32 bit registers so there are at 14 (or more) regs per thread.

There’s a white paper out at NVIDIA’s website that talks a little bit more about the SMX architecture. Looks like they’ve traded Fermi’s higher clock rate for more cores with shorter pipelines. Also, if I’m reading this right, the instruction latency is now constant for all instructions (or at least all the math instructions). Maybe now instruction latency will go from 18-26 with Fermi down to something like 10 or 12 with Kepler? That would give us 64*2^32 registers / 192 cores / 12 resident threads per core = 28 registers per thread, which is pretty close to before.

The white paper also makes it sound like with constant instruction latency, a lot of the scheduling can now be offloaded to the compiler/assembler, which makes sense. Anyone else have any thoughts? I can’t wait for some CUDA-related documentation to be released.

Altogether the SMX now has 15 functional units that the warp schedulers can call on. Each of the 4 schedulers in turn can issue up to 2 instructions per clock if there’s ILP to be extracted from their respective warps, allowing the schedulers as a whole to issue instructions to up to 8 of the 15 functional units in any clock cycle.

GK104 SMX Functional Units

    [*] 32 CUDA cores (#1)[*] 32 CUDA cores (#2)[*] 32 CUDA cores (#3)[*] 32 CUDA cores (#4)[*] 32 CUDA cores (#5)[*] 32 CUDA cores (#6)[*] 16 Load/Store Units (#1)[*] 16 Load/Store Units (#2)[*] 16 Interpolation SFUs (#1)[*] 16 Interpolation SFUs (#2)[*] 16 Special Function SFUs (#1)[*] 16 Special Function SFUs (#2)[*] 8 Texture Units (#1)[*] 8 Texture Units (#2)[*] 8 CUDA FP64 cores

The other change coming from GF114 is the mysterious block #15 … The CUDA FP64 block contains 8 special CUDA cores that are not part of the general CUDA core count and are not in any of NVIDIA’s diagrams. These CUDA cores can only do and are only used for FP64 math. What more, the CUDA FP64 block has a very special execution rate: 1/1 FP32. With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute a whole warp

With GK104 NVIDIA is going back to static scheduling … since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far more simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling. http://www.anandtech…tx-680-review/2

EDIT: This thing won’t fly without compiler support

hooray, I can talk about GK104 now :)

a CUDA toolkit that supports GK104 will be out in the near future (4.2 RC–it’s 4.1 + some bug fixes + Kepler support).

Awesome!

I think everyone on this thread was concerned with the SMX only doubling the number of registers while increasing the cores by 6x. But after reading the whitepaper I’m hoping we can assume that the core pipelines are shorter (much shorter?) and that we’ll still be able to fully utilize the SMX without reducing our existing practical register-per-thread designs? Also, I assume we can still obtain 63 registers per thread for compute-bound kernels?

Can’t wait to see the new docs… and any new PTX instructions. :)

What CC Kepler has? 3.0?

Yes, GK104 has CC 3.0.

So the big question which probably you guys still can’t talk about is the differences between today’s GK104 and GK100. The GK104 whitepaper but especially benchmarks show GK104’s excellent graphics performance, but not quite as revolutionary CUDA performance. (Admittedly none of the review sites go deep into CUDA benchmarks!) GK100 will obviously be bigger, and have all the frills like FP64, ECC. Will it still use GK104’s static scheduler? Some of Anand’s analysis suggests the dynamic scheduler was better for compute, but not needed as much for graphics.

What is GK100? O:

GK110? :-)

What is GK110? External Image

External Image

So… how about upcoming nifty new Teslas and Quadros?

External Image