192 cuda cores - how they are organized 6x32 or 4x32 + 4x16?
Interesting reading: "[url="http://techreport.com/articles.x/22653"]Nvidia's GeForce GTX 680 graphics processor[/url]" by Scott Wasson. A citation:

[quote]The organization of the SMX's execution units isn't truly apparent in the diagram above. Although Nvidia likes to talk about them as individual "cores," the ALUs are actually grouped into execution units of varying widths. In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit.[/quote]
That's curious as the best multiply-add throughput I can get on GTX 680 corresponds to 5.5 x 32 ALUs per SM, not 6 x 32 as the sheer core count suggests. Also, the best throughput in integer operations corresponds to 32 cores per SM.

Comments?
Interesting reading: "Nvidia's GeForce GTX 680 graphics processor" by Scott Wasson. A citation:



The organization of the SMX's execution units isn't truly apparent in the diagram above. Although Nvidia likes to talk about them as individual "cores," the ALUs are actually grouped into execution units of varying widths. In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit.


That's curious as the best multiply-add throughput I can get on GTX 680 corresponds to 5.5 x 32 ALUs per SM, not 6 x 32 as the sheer core count suggests. Also, the best throughput in integer operations corresponds to 32 cores per SM.



Comments?

#1
Posted 04/28/2012 10:40 PM   
[quote name='vvolkov' date='28 April 2012 - 03:40 PM' timestamp='1335652858' post='1402072']
Also, the best throughput in integer operations corresponds to 32 cores per SM.
[/quote]

That integer measurement sounds like a match. The throughput table in the updated programming guide (section 5.4.1) says 32 ops/clock/SM for shift/mul/mad/sad.
[quote name='vvolkov' date='28 April 2012 - 03:40 PM' timestamp='1335652858' post='1402072']

Also, the best throughput in integer operations corresponds to 32 cores per SM.





That integer measurement sounds like a match. The throughput table in the updated programming guide (section 5.4.1) says 32 ops/clock/SM for shift/mul/mad/sad.

#2
Posted 04/29/2012 12:17 AM   
Random idea: have you tried using a different multiple of warps (60 vs. 64, 30 vs. 32)? I'm wondering if the new Kepler scheduler could benefit from a simpler mapping between warps and the 6x32 (4x48?) vector units? Just a thought...

Also, if compiler is determining instruction scheduling and, implicitly, warp scheduling up front then perhaps handing the compiler explicit kernel launch bounds (section B.19 of Guide) might help?

No idea. I will get a 680 right after the GTC. :)
Random idea: have you tried using a different multiple of warps (60 vs. 64, 30 vs. 32)? I'm wondering if the new Kepler scheduler could benefit from a simpler mapping between warps and the 6x32 (4x48?) vector units? Just a thought...



Also, if compiler is determining instruction scheduling and, implicitly, warp scheduling up front then perhaps handing the compiler explicit kernel launch bounds (section B.19 of Guide) might help?



No idea. I will get a 680 right after the GTC. :)

#3
Posted 04/29/2012 12:30 AM   
[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']
Random idea: have you tried using a different multiple of warps (60 vs. 64, 30 vs. 32)? I'm wondering if the new Kepler scheduler could benefit from a simpler mapping between warps and the 6x32 (4x48?) vector units? Just a thought...
[/quote]
If I run only 1 thread block, then the best performance is at the multiples of 4 warps. Sounds reasonable given 4 warp schedulers per SM.

But can't get good performance unless running at least 2 warps per scheduler! With 1 warp the throughput is nearly exactly 2x lower. Have seen a similar story on G80 - it couldn't run fast with 1 warp per SM, or even 1 warp per block. (But 2 warps were just fine.)

[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']
Also, if compiler is determining instruction scheduling and, implicitly, warp scheduling up front then perhaps handing the compiler explicit kernel launch bounds (section B.19 of Guide) might help?
[/quote]
B.19 doesn't talk about scheduling, only about register usage. Still, might be worth a try...

[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']
No idea. I will get a 680 right after the GTC. :)
[/quote]
Don't you want to wait for the widely rumored "big Kepler"? /wink.gif' class='bbc_emoticon' alt=';)' />

[quote name='allanmac' date='28 April 2012 - 05:17 PM' timestamp='1335658654' post='1402095']
That integer measurement sounds like a match. The throughput table in the updated programming guide (section 5.4.1) says 32 ops/clock/SM for shift/mul/mad/sad.
[/quote]
Thanks for this pointer! Now I am totally confused. Throughput of 32-bit integer adds is 168 per SM, logical operations are at 136 per SM - these are not even multiples of 16! How does it all work?!?
[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']

Random idea: have you tried using a different multiple of warps (60 vs. 64, 30 vs. 32)? I'm wondering if the new Kepler scheduler could benefit from a simpler mapping between warps and the 6x32 (4x48?) vector units? Just a thought...



If I run only 1 thread block, then the best performance is at the multiples of 4 warps. Sounds reasonable given 4 warp schedulers per SM.



But can't get good performance unless running at least 2 warps per scheduler! With 1 warp the throughput is nearly exactly 2x lower. Have seen a similar story on G80 - it couldn't run fast with 1 warp per SM, or even 1 warp per block. (But 2 warps were just fine.)



[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']

Also, if compiler is determining instruction scheduling and, implicitly, warp scheduling up front then perhaps handing the compiler explicit kernel launch bounds (section B.19 of Guide) might help?



B.19 doesn't talk about scheduling, only about register usage. Still, might be worth a try...



[quote name='allanmac' date='28 April 2012 - 05:30 PM' timestamp='1335659450' post='1402097']

No idea. I will get a 680 right after the GTC. :)



Don't you want to wait for the widely rumored "big Kepler"? /wink.gif' class='bbc_emoticon' alt=';)' />

[quote name='allanmac' date='28 April 2012 - 05:17 PM' timestamp='1335658654' post='1402095']

That integer measurement sounds like a match. The throughput table in the updated programming guide (section 5.4.1) says 32 ops/clock/SM for shift/mul/mad/sad.



Thanks for this pointer! Now I am totally confused. Throughput of 32-bit integer adds is 168 per SM, logical operations are at 136 per SM - these are not even multiples of 16! How does it all work?!?

#4
Posted 04/29/2012 03:05 AM   
There is a 4/5/2012 and a 4/16/2012 Programming Guide (on the [url="http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf"]developer site[/url]). The Win64 and OSX 4.2 installers have the 4/5/2012 copy -- at least on my machines.

The simplest reading of the table in the latest guide seems to imply that out of the 192 cores there are 32 that can perform int32 shift/mul/mad/sad each clock and the remaining 160 can do int32 comparisons and logic (assuming all the operations are actually single cycle). Hopefully someone from NVIDIA can provide details at the GTC session on Kepler. :)

The aluminum/magnesium/polycarbonate GTX 690 was announced tonight in Shanghai. I'm not sure I could bear putting that card on the *inside* of my case given how cool it looks. If the "Big Kepler" is going to top the GTX 690 it better come shrouded in gold!
There is a 4/5/2012 and a 4/16/2012 Programming Guide (on the developer site). The Win64 and OSX 4.2 installers have the 4/5/2012 copy -- at least on my machines.



The simplest reading of the table in the latest guide seems to imply that out of the 192 cores there are 32 that can perform int32 shift/mul/mad/sad each clock and the remaining 160 can do int32 comparisons and logic (assuming all the operations are actually single cycle). Hopefully someone from NVIDIA can provide details at the GTC session on Kepler. :)



The aluminum/magnesium/polycarbonate GTX 690 was announced tonight in Shanghai. I'm not sure I could bear putting that card on the *inside* of my case given how cool it looks. If the "Big Kepler" is going to top the GTX 690 it better come shrouded in gold!

#5
Posted 04/29/2012 05:16 AM   
(table attached)
(table attached)
Attachments

throughput.png

#6
Posted 04/29/2012 05:20 AM   
Scroll To Top