GTC Keynote Thread

seibert · May 15, 2012, 7:17pm

Here’s some notes (with my interpretation, since details were a little thin) from the GTC keynote live stream today:

The Telsa rumors seem to be correct. The first release of Kepler for the Tesla product line (the Tesla K10) looks basically like an enterprise-y GTX 690. The only detail given was that the single precision is 3x faster than Fermi, which sounds about right for two GK104 chips.
The next Kepler chip will be released in the Tesla K20 and have good (though not quantified) double precision performance and two interesting new features: Hyper-Q and dynamic parallelism. Edit: The Tesla K20 will be out in Q4 2012.
Hyper-Q: The GPU supports multiple “work queues”, which appears to be what is required to support multiple concurrent CUDA contexts on one device. Kepler (not GK104, I assume) will support 32 concurrent work queues, which I assume will translate to 32 concurrent processes. I think this will finally put the watchdog to bed. :) (Edit: This might not be the right interpretation. See my post below.)
“Dynamic Parallelism”: This is their term for the ability for kernels to launch other kernels. I will be curious to see how this is exposed to the software developer, but this is definitely a frequently requested feature.

There were some other interesting, but less CUDA related things:

Full GPU virtualization to allow virtual machines to share a single GPU.
Hardware support for streaming the GPU framebuffer to a remote device.

I missed the first 30 minutes, so those of you who were there should chime in with other details I missed.

Edit: Of course the real question is whether the GPU in the K20 will show up in a GeForce. No information was given on this (obviously), but I suspect the time frame for the K20 means we aren’t going to see any GeForce improvements until close to the end of the year. Horde your GTX 580s! :)

AlexanderMalishev · May 15, 2012, 7:31pm

I am curious what compute capability gk100 has?

seibert · May 16, 2012, 1:13am

If you mean the new GPU in the future Tesla K20, then I don’t think it will be compute capability 3.0.

Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.

tera · May 16, 2012, 1:37am

Thanks Seibert for this labor of love for those who didn’t watch the stream!

It appears there will be GK110 based consumer cards next year. Yay!

tmurray · May 16, 2012, 2:05am

sm_35

Gilles_C · May 16, 2012, 3:09am

Hi,

There was a talk at 4pm about “Cuda 5 and beyond” where the new features Kepler GK104 and GK110 are proposing were presented. Unfortunately, I missed the 10 first minutes of it (how con one be so stupid as to miss something like this?). Nonetheless, what I saw was trilling:

[*]Dynamic programming: that’s the big thing! You’ll be able to launch kernels from kernel, and have them to behave the way you would expect them to. For mesh refinement, that’s a killer feature!

[*]GPUDirect to the next level: it becomes what you always expected it to be (and what you might have thought it was already), ie. a proper P2P GPU memory transfer through RDMA, even when playing with clusters.

[*]Hyper-Q: well maybe it was during the time I missed… But that’s just trilling as well, especially for moderately parallel algorithms, where a sharing the GPU between many processes is possible. I have many of those sort of workload ready to exploit this feature.

As much as I haven’t been trilled by the gaming and video rendering part of the keynotes, as much I feels the HPC part and the CUDA roadmap exiting.

I can’t wait to see the sessions on “inside Kepler” and the “new features in the CUDA programming model”.

ymc · May 16, 2012, 9:17am

Well, the good news here is they are not going to screw the 690 owners with Tesla K10. People can either buy their 690s or wait for K20. External Image

Jimmy_Pettersson · May 16, 2012, 11:53am

That’s really good news.

seibert · May 16, 2012, 1:27pm

Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.

However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn’t being advertised.

seibert · May 16, 2012, 2:39pm

In thinking about applications for dynamic parallelism, I’m wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.

tmurray · May 16, 2012, 3:09pm

Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.

However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn’t being advertised.

look at nvidia-cuda-proxy-control on Linux

DrAnderson42 · May 16, 2012, 3:36pm

To expand on Tim’s answer slightly, there is a technology called Proxy in CUDA 5. As I understand it, it is designed for use in MPI programs and it creates one CUDA context for all MPI ranks that share the same GPU.

Sarah Tariq briefly talked about this in her talk “S0351 - Strong Scaling for Molecular Dynamics Applications” (for those of you that want to pull up the video when its posted). She presented some benchmarks of how using Proxy improved NAMD performance significantly.

eyalhir74 · May 16, 2012, 4:17pm

Does anyone remember when did nVidia start talking about this feature? date? any reference on the web for that?

thanks

tmurray · May 16, 2012, 5:20pm

yesterday at the keynote :)

eyalhir74 · May 16, 2012, 5:27pm

Thanks Tim. I was actually looking for some possible reference from Jen-Hsun at one of the previous GTCs (back in 2009 or 2008)…

How long has nVidia been working on this and whether this kind of thing (not the actuall name) was hinted in the past by nVidia?

thanks

tmurray · May 16, 2012, 9:02pm

I think he mentioned it at GTC 2010 when he first mentioned Kepler and Maxwell.

We’ve been working on this for several years. I probably shouldn’t be more specific than that…

tmurray · May 16, 2012, 9:14pm

also, we can now say that the 7.1 billion transistor chip is GK110. thanks, Inside Kepler presentation!

Gilles_C · May 16, 2012, 10:01pm

Kepler GK110 reference web page: Technical Documentation, Specs, Customer Stories | NVIDIA Tesla
Apparently, the white paper is for tomorrow.

tmurray · May 16, 2012, 10:04pm

it’s up now: Page Not Found | NVIDIA

ymc · May 17, 2012, 1:21am

Is K20’s DP performance expected to be 1/3 of its SP performance given the fact that DP cores count is a third of SP cores?I think a single GPU K20 should perform similarly to K10 in the SP department. Then with one third SP performance, you have 1.5TFLOPS DP performance External Image