GTC Keynote Thread
  1 / 3    
Here's some notes (with my interpretation, since details were a little thin) from the GTC keynote live stream today:

* The Telsa rumors seem to be correct. The first release of Kepler for the Tesla product line (the Tesla K10) looks basically like an enterprise-y GTX 690. The only detail given was that the single precision is 3x faster than Fermi, which sounds about right for two GK104 chips.

* The next Kepler chip will be released in the Tesla K20 and have good (though not quantified) double precision performance and two interesting new features: Hyper-Q and dynamic parallelism. Edit: The Tesla K20 will be out in Q4 2012.

* Hyper-Q: The GPU supports multiple "work queues", which appears to be what is required to support multiple concurrent CUDA contexts on one device. Kepler (not GK104, I assume) will support 32 concurrent work queues, which I assume will translate to 32 concurrent processes. I think this will finally put the watchdog to bed. :) (Edit: This might not be the right interpretation. See my post below.)

* "Dynamic Parallelism": This is their term for the ability for kernels to launch other kernels. I will be curious to see how this is exposed to the software developer, but this is definitely a frequently requested feature.


There were some other interesting, but less CUDA related things:

* Full GPU virtualization to allow virtual machines to share a single GPU.

* Hardware support for streaming the GPU framebuffer to a remote device.

I missed the first 30 minutes, so those of you who were there should chime in with other details I missed.


Edit: Of course the real question is whether the GPU in the K20 will show up in a GeForce. No information was given on this (obviously), but I suspect the time frame for the K20 means we aren't going to see any GeForce improvements until close to the end of the year. Horde your GTX 580s! :)
Here's some notes (with my interpretation, since details were a little thin) from the GTC keynote live stream today:



* The Telsa rumors seem to be correct. The first release of Kepler for the Tesla product line (the Tesla K10) looks basically like an enterprise-y GTX 690. The only detail given was that the single precision is 3x faster than Fermi, which sounds about right for two GK104 chips.



* The next Kepler chip will be released in the Tesla K20 and have good (though not quantified) double precision performance and two interesting new features: Hyper-Q and dynamic parallelism. Edit: The Tesla K20 will be out in Q4 2012.



* Hyper-Q: The GPU supports multiple "work queues", which appears to be what is required to support multiple concurrent CUDA contexts on one device. Kepler (not GK104, I assume) will support 32 concurrent work queues, which I assume will translate to 32 concurrent processes. I think this will finally put the watchdog to bed. :) (Edit: This might not be the right interpretation. See my post below.)



* "Dynamic Parallelism": This is their term for the ability for kernels to launch other kernels. I will be curious to see how this is exposed to the software developer, but this is definitely a frequently requested feature.





There were some other interesting, but less CUDA related things:



* Full GPU virtualization to allow virtual machines to share a single GPU.



* Hardware support for streaming the GPU framebuffer to a remote device.



I missed the first 30 minutes, so those of you who were there should chime in with other details I missed.





Edit: Of course the real question is whether the GPU in the K20 will show up in a GeForce. No information was given on this (obviously), but I suspect the time frame for the K20 means we aren't going to see any GeForce improvements until close to the end of the year. Horde your GTX 580s! :)

#1
Posted 05/15/2012 07:17 PM   
I am curious what compute capability gk100 has?
I am curious what compute capability gk100 has?

#2
Posted 05/15/2012 07:31 PM   
If you mean the new GPU in the future Tesla K20, then I don't think it will be compute capability 3.0.

Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.
If you mean the new GPU in the future Tesla K20, then I don't think it will be compute capability 3.0.



Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.

#3
Posted 05/16/2012 01:13 AM   
Thanks Seibert for this labor of love for those who didn't watch the stream!

It appears [url="http://translate.google.com/translate?u=http%3A%2F%2Fwww.heise.de%2Fnewsticker%2Fmeldung%2FGTC-2012-GK110-Grafikchip-hat-bis-zu-2880-Shader-Kerne-1576464.html&act=url"]there will be GK110 based consumer cards next year[/url]. Yay!
Thanks Seibert for this labor of love for those who didn't watch the stream!



It appears there will be GK110 based consumer cards next year. Yay!

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 05/16/2012 01:37 AM   
[quote name='seibert' date='15 May 2012 - 05:13 PM' timestamp='1337130828' post='1408915']
If you mean the new GPU in the future Tesla K20, then I don't think it will be compute capability 3.0.

Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.
[/quote]
sm_35
[quote name='seibert' date='15 May 2012 - 05:13 PM' timestamp='1337130828' post='1408915']

If you mean the new GPU in the future Tesla K20, then I don't think it will be compute capability 3.0.



Hint: If you have a registered developer account (on the old or new site), you should log in and check out the CUDA 5.0 toolkit that has been posted. There is some new information in the CUDA Programming Guide about this future architecture.



sm_35

#5
Posted 05/16/2012 02:05 AM   
Hi,

There was a talk at 4pm about "Cuda 5 and beyond" where the new features Kepler GK104 and GK110 are proposing were presented. Unfortunately, I missed the 10 first minutes of it (how con one be so stupid as to miss something like this?). Nonetheless, what I saw was trilling:
[list]
[*]Dynamic programming: that's the big thing! You'll be able to launch kernels from kernel, and have them to behave the way you would expect them to. For mesh refinement, that's a killer feature!
[*]GPUDirect to the next level: it becomes what you always expected it to be (and what you might have thought it was already), ie. a proper P2P GPU memory transfer through RDMA, even when playing with clusters.
[*]Hyper-Q: well maybe it was during the time I missed... But that's just trilling as well, especially for moderately parallel algorithms, where a sharing the GPU between many processes is possible. I have many of those sort of workload ready to exploit this feature.
[/list]
As much as I haven't been trilled by the gaming and video rendering part of the keynotes, as much I feels the HPC part and the CUDA roadmap exiting.
I can't wait to see the sessions on "inside Kepler" and the "new features in the CUDA programming model".
Hi,



There was a talk at 4pm about "Cuda 5 and beyond" where the new features Kepler GK104 and GK110 are proposing were presented. Unfortunately, I missed the 10 first minutes of it (how con one be so stupid as to miss something like this?). Nonetheless, what I saw was trilling:


  • Dynamic programming: that's the big thing! You'll be able to launch kernels from kernel, and have them to behave the way you would expect them to. For mesh refinement, that's a killer feature!
  • GPUDirect to the next level: it becomes what you always expected it to be (and what you might have thought it was already), ie. a proper P2P GPU memory transfer through RDMA, even when playing with clusters.
  • Hyper-Q: well maybe it was during the time I missed... But that's just trilling as well, especially for moderately parallel algorithms, where a sharing the GPU between many processes is possible. I have many of those sort of workload ready to exploit this feature.


As much as I haven't been trilled by the gaming and video rendering part of the keynotes, as much I feels the HPC part and the CUDA roadmap exiting.

I can't wait to see the sessions on "inside Kepler" and the "new features in the CUDA programming model".

#6
Posted 05/16/2012 03:09 AM   
Well, the good news here is they are not going to screw the 690 owners with Tesla K10. People can either buy their 690s or wait for K20. /thumbup.gif' class='bbc_emoticon' alt=':thumbup:' />
Well, the good news here is they are not going to screw the 690 owners with Tesla K10. People can either buy their 690s or wait for K20. /thumbup.gif' class='bbc_emoticon' alt=':thumbup:' />

#7
Posted 05/16/2012 09:17 AM   
[quote name='tera' date='16 May 2012 - 02:37 AM' timestamp='1337132257' post='1408925']
Thanks Seibert for this labor of love for those who didn't watch the stream!

It appears [url="http://translate.google.com/translate?u=http%3A%2F%2Fwww.heise.de%2Fnewsticker%2Fmeldung%2FGTC-2012-GK110-Grafikchip-hat-bis-zu-2880-Shader-Kerne-1576464.html&act=url"]there will be GK110 based consumer cards next year[/url]. Yay!
[/quote]

That's really good news.
[quote name='tera' date='16 May 2012 - 02:37 AM' timestamp='1337132257' post='1408925']

Thanks Seibert for this labor of love for those who didn't watch the stream!



It appears there will be GK110 based consumer cards next year. Yay!





That's really good news.

#8
Posted 05/16/2012 11:53 AM   
Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.

However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn't being advertised.
Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.



However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn't being advertised.

#9
Posted 05/16/2012 01:27 PM   
In thinking about applications for dynamic parallelism, I'm wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.
In thinking about applications for dynamic parallelism, I'm wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.

#10
Posted 05/16/2012 02:39 PM   
[quote name='seibert' date='16 May 2012 - 05:27 AM' timestamp='1337174839' post='1409122']
Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.

However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn't being advertised.
[/quote]
look at nvidia-cuda-proxy-control on Linux
[quote name='seibert' date='16 May 2012 - 05:27 AM' timestamp='1337174839' post='1409122']

Reading through the Kepler Tuning Guide in the CUDA 5.0 documentation, I think I might have misinterpreted the presentation describing Hyper-Q. It sounds like what Hyper-Q fixes is a more subtle problem with multiple CUDA streams in a single process blocking each other in Fermi. The fundamental problem with Fermi (apparently) is that the driver has to serialize work from many queues into a single hardware queue on the device. This limits the power of multiple streams because you are locked into your queue ordering too soon, leading to suboptimal utilization for various combinations of queued work. Hyper-Q exposes multiple hardware queues to the driver, so that CUDA streams in software can map to hardware work queues and defer the scheduling decision as late as possible.



However, no where in this documentation does it say anything about multiple processes using the same GPU at the same time. That might still be possible, but it isn't being advertised.



look at nvidia-cuda-proxy-control on Linux

#11
Posted 05/16/2012 03:09 PM   
[quote name='tmurray' date='16 May 2012 - 10:09 AM' timestamp='1337180953' post='1409145']
look at nvidia-cuda-proxy-control on Linux
[/quote]
To expand on Tim's answer slightly, there is a technology called Proxy in CUDA 5. As I understand it, it is designed for use in MPI programs and it creates one CUDA context for all MPI ranks that share the same GPU.

Sarah Tariq briefly talked about this in her talk "S0351 - Strong Scaling for Molecular Dynamics Applications" (for those of you that want to pull up the video when its posted). She presented some benchmarks of how using Proxy improved NAMD performance significantly.
[quote name='tmurray' date='16 May 2012 - 10:09 AM' timestamp='1337180953' post='1409145']

look at nvidia-cuda-proxy-control on Linux



To expand on Tim's answer slightly, there is a technology called Proxy in CUDA 5. As I understand it, it is designed for use in MPI programs and it creates one CUDA context for all MPI ranks that share the same GPU.



Sarah Tariq briefly talked about this in her talk "S0351 - Strong Scaling for Molecular Dynamics Applications" (for those of you that want to pull up the video when its posted). She presented some benchmarks of how using Proxy improved NAMD performance significantly.

#12
Posted 05/16/2012 03:36 PM   
[quote name='seibert' date='16 May 2012 - 05:39 PM' timestamp='1337179149' post='1409138']
In thinking about applications for dynamic parallelism, I'm wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.
[/quote]
Does anyone remember when did nVidia start talking about this feature? date? any reference on the web for that?

thanks
[quote name='seibert' date='16 May 2012 - 05:39 PM' timestamp='1337179149' post='1409138']

In thinking about applications for dynamic parallelism, I'm wondering how efficient the launch mechanism is. I could imagine a tree-traversal method where you launch a single block kernel from the CPU, and that block in turn launches several more single block kernels, and so on recursively.



Does anyone remember when did nVidia start talking about this feature? date? any reference on the web for that?



thanks

#13
Posted 05/16/2012 04:17 PM   
[quote name='eyalhir74' date='16 May 2012 - 08:17 AM' timestamp='1337185064' post='1409178']
Does anyone remember when did nVidia start talking about this feature? date? any reference on the web for that?

thanks
[/quote]
yesterday at the keynote :)
[quote name='eyalhir74' date='16 May 2012 - 08:17 AM' timestamp='1337185064' post='1409178']

Does anyone remember when did nVidia start talking about this feature? date? any reference on the web for that?



thanks



yesterday at the keynote :)

#14
Posted 05/16/2012 05:20 PM   
[quote name='tmurray' date='16 May 2012 - 08:20 PM' timestamp='1337188842' post='1409206']
yesterday at the keynote :)
[/quote]
Thanks Tim. I was actually looking for some possible reference from Jen-Hsun at one of the previous GTCs (back in 2009 or 2008)...
How long has nVidia been working on this and whether this kind of thing (not the actuall name) was hinted in the past by nVidia?

thanks
[quote name='tmurray' date='16 May 2012 - 08:20 PM' timestamp='1337188842' post='1409206']

yesterday at the keynote :)



Thanks Tim. I was actually looking for some possible reference from Jen-Hsun at one of the previous GTCs (back in 2009 or 2008)...

How long has nVidia been working on this and whether this kind of thing (not the actuall name) was hinted in the past by nVidia?



thanks

#15
Posted 05/16/2012 05:27 PM   
  1 / 3    
Scroll To Top