What's new in Maxwell 'sm_52' (GTX 9xx) ?
What's new in Maxwell 'sm_52' (GTX 9xx)? [list] [.]The first CUDA difference noted [url=http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/]on the NVIDIA blog[/url] is that shared memory has been bumped up to 96 KB. That's 2x Kepler and 50% more than Maxwell v1. That's a welcome change since some people had kernels tuned for a shared-to-register ratio of 1.5 -- i.e. the Fermi ratio which allowed about 96 bytes per thread in a full-sized 63 register x 512 thread block. With Kepler/Maxwell-v1/Maxwell-v2 having 64K 32-bit registers, Maxwell-v2 returns to that ratio and there are once again 24 32-bit words of shared mem per 64 register x 1024 thread block.[/.] [.]The Maxwell Tuning Guide and the CUDA C Programming Guide note that similar to GK110B, GM204 can "[u]opt-in to caching of global loads in its unified L1/Texture cache.[/u]"[/.] [.]There appears to be [url=https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_shader_atomic_fp16_vector.txt]support for FP16 vector atomics[/url] operating on global memory. Expose this in CUDA, please![/.] [.]The GTX 980 is [url=https://devtalk.nvidia.com/default/topic/776043/cuda-programming-and-performance/whats-new-in-maxwell-sm_52-gtx-9xx-/post/4317593/#4317593]reported as having two asynchronous copy engines[/url].[/.] [.]There is also a new [url=https://developer.nvidia.com/cuda-downloads-geforce-gtx9xx]CUDA Toolkit with sm_52 support[/url].[/.] [.]New drivers: [url=http://nvidia.com/drivers]343/344.xx[/url]. FYI, these drivers no longer support sm_1x devices. I had to remove a GT 240 (x1) this morning in order to boot Win7/x64.[/.] [.]Boost clocks on the 980 look to be as high as we've seen on the 750 Ti. Some of the "golden" GTX 750 Ti's boosted to 1320 MHz out of the box. Amazingly there is an EVGA 980 listed with a guaranteed boost of 1342 MHz (!). And @cbuchner1's crypto link shows overclocks reaching 1520 Mhz (!).[/.] [/list] Anything else?
What's new in Maxwell 'sm_52' (GTX 9xx)?

  • The first CUDA difference noted on the NVIDIA blog is that shared memory has been bumped up to 96 KB. That's 2x Kepler and 50% more than Maxwell v1.

    That's a welcome change since some people had kernels tuned for a shared-to-register ratio of 1.5 -- i.e. the Fermi ratio which allowed about 96 bytes per thread in a full-sized 63 register x 512 thread block.

    With Kepler/Maxwell-v1/Maxwell-v2 having 64K 32-bit registers, Maxwell-v2 returns to that ratio and there are once again 24 32-bit words of shared mem per 64 register x 1024 thread block.

  • The Maxwell Tuning Guide and the CUDA C Programming Guide note that similar to GK110B, GM204 can "opt-in to caching of global loads in its unified L1/Texture cache."

  • There appears to be support for FP16 vector atomics operating on global memory. Expose this in CUDA, please!

  • The GTX 980 is reported as having two asynchronous copy engines.

  • There is also a new CUDA Toolkit with sm_52 support.

  • New drivers: 343/344.xx. FYI, these drivers no longer support sm_1x devices. I had to remove a GT 240 (x1) this morning in order to boot Win7/x64.

  • Boost clocks on the 980 look to be as high as we've seen on the 750 Ti. Some of the "golden" GTX 750 Ti's boosted to 1320 MHz out of the box. Amazingly there is an EVGA 980 listed with a guaranteed boost of 1342 MHz (!). And @cbuchner1's crypto link shows overclocks reaching 1520 Mhz (!).

Anything else?

#1
Posted 09/19/2014 03:30 AM   
Cool, I'll be picking one up as soon I can find one.. Don't see one listed on NewEgg yet. I have a ton of in depth detail on Maxwell that I still need to write up. Been too busy tweaking further performance and features out of my assembler. My sgemm implementation now runs at over 98% efficiency, 3.7% over cublas. This is pretty much right at the synthetic level minus the small overhead of things like bar.syncs you need for real code. Anyway, with this new hardware that should translate into close to 200Gflops over cublas or about 5.3 Tflops total.
Cool, I'll be picking one up as soon I can find one.. Don't see one listed on NewEgg yet. I have a ton of in depth detail on Maxwell that I still need to write up. Been too busy tweaking further performance and features out of my assembler. My sgemm implementation now runs at over 98% efficiency, 3.7% over cublas. This is pretty much right at the synthetic level minus the small overhead of things like bar.syncs you need for real code. Anyway, with this new hardware that should translate into close to 200Gflops over cublas or about 5.3 Tflops total.

#2
Posted 09/19/2014 04:17 AM   
Wow! You should take some power measurements too, if possible. It would be cool to see how hard your SGEMM is pushing the PCB. Either a Kill-a-Watt or the TDP sensor output in something like GPU-Z (it shows a percent of TDP on the 750 Ti).
Wow! You should take some power measurements too, if possible. It would be cool to see how hard your SGEMM is pushing the PCB. Either a Kill-a-Watt or the TDP sensor output in something like GPU-Z (it shows a percent of TDP on the 750 Ti).

#3
Posted 09/19/2014 04:26 AM   
Can someone please post the entire deviceQuery output for GM204? Also amazing architecture, Maxwell second generation practically demolishes Big Kepler GK110 while using a lot less power in compute and it's not even Big Maxwell GM200 yet.
Can someone please post the entire deviceQuery output for GM204?

Also amazing architecture, Maxwell second generation practically demolishes Big Kepler GK110 while using a lot less power in compute and it's not even Big Maxwell GM200 yet.

#4
Posted 09/19/2014 08:28 AM   
It is still using a 256bits bus, which I am not loving. It has less bandwidth than a 780Ti by a fair margin (224 vs 336). Even the 780 (not Ti), was running at 384 bits for 288GB/s of bandwidth. Hopefully, since this is GM204, we will see a GM200 in the first Tesla product with a wider 384 bits memory bus.
It is still using a 256bits bus, which I am not loving. It has less bandwidth than a 780Ti by a fair margin (224 vs 336). Even the 780 (not Ti), was running at 384 bits for 288GB/s of bandwidth.

Hopefully, since this is GM204, we will see a GM200 in the first Tesla product with a wider 384 bits memory bus.

#5
Posted 09/19/2014 12:20 PM   
I am still sitting on 3 GTX 780Ti from the crypto mining craze of last year. IMHO this upgraded Maxwell architecture does not really have any killer features that want me to switch hardware right now. They also dropped some instructions from the CUDA cores to save die space and power - the video instructions in particular. So, I guess I'll pass. These 780Ti's are going to serve me well enough. For those into crypto mining (still), this might be of interest: GTX 980 crypto mining performance: [url]http://cryptomining-blog.com/3503-crypto-mining-performance-of-the-new-nvidia-geforce-gtx-980/[/url] the table with the raw performance figures: [img]http://cryptomining-blog.com/wp-content/uploads/2014/09/gtx-980-cryptomining-hashrate.jpg[/img]
I am still sitting on 3 GTX 780Ti from the crypto mining craze of last year. IMHO this upgraded Maxwell architecture does not really have any killer features that want me to switch hardware right now. They also dropped some instructions from the CUDA cores to save die space and power - the video instructions in particular. So, I guess I'll pass. These 780Ti's are going to serve me well enough.

For those into crypto mining (still), this might be of interest:

GTX 980 crypto mining performance:
http://cryptomining-blog.com/3503-crypto-mining-performance-of-the-new-nvidia-geforce-gtx-980/

the table with the raw performance figures:
Image

#6
Posted 09/19/2014 12:26 PM   
I am slightly curious as to why I cannot find any review on my usual suspects that talk about compute performance. This seems odd as usually they have at least one mention of GPGPU. Has anyone else managed to find a compute based review (aside from the crypto stuff above). The cynic in me wonders whether the reviewers have been 'advised' not to include compute figures and focus on the (admittedly very impressive) gaming performance.
I am slightly curious as to why I cannot find any review on my usual suspects that talk about compute performance. This seems odd as usually they have at least one mention of GPGPU. Has anyone else managed to find a compute based review (aside from the crypto stuff above). The cynic in me wonders whether the reviewers have been 'advised' not to include compute figures and focus on the (admittedly very impressive) gaming performance.

#7
Posted 09/19/2014 12:59 PM   
I've seen Luxmark 2.0 figures in many benchmarks. I believe this measures OpenCL performance.
I've seen Luxmark 2.0 figures in many benchmarks. I believe this measures OpenCL performance.

#8
Posted 09/19/2014 01:22 PM   
Anandtech has some compute numbers. [url]http://anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20[/url]
Anandtech has some compute numbers.
http://anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20

#9
Posted 09/19/2014 01:26 PM   
Thanks, that definitely is a bit surprising in a good way. It actually looks like there is an improvement in compute. I will still be wary of the memory bandwidth until I get my grubby hands on one for a good thrashing.
Thanks, that definitely is a bit surprising in a good way. It actually looks like there is an improvement in compute. I will still be wary of the memory bandwidth until I get my grubby hands on one for a good thrashing.

#10
Posted 09/19/2014 02:08 PM   
NewEgg does list the GTX 980 and GTX 970 now, but 'sold out' with an ETA of 9/23. [url]http://www.newegg.com/Product/Product.aspx?Item=N82E16814487067&cm_re=gtx_980-_-14-487-067-_-Product[/url] Hopefully there will be a GTX 980ti released. Regardless I am going to get one and run my atypical compute tests against the GTX 780ti. Nvidia says that they got 2.7 Tflops for the nbody, while I was only able to get 2.1 Tflops with the GTX 780ti.
NewEgg does list the GTX 980 and GTX 970 now, but 'sold out' with an ETA of 9/23.

http://www.newegg.com/Product/Product.aspx?Item=N82E16814487067&cm_re=gtx_980-_-14-487-067-_-Product

Hopefully there will be a GTX 980ti released.

Regardless I am going to get one and run my atypical compute tests against the GTX 780ti.
Nvidia says that they got 2.7 Tflops for the nbody, while I was only able to get 2.1 Tflops with the GTX 780ti.

#11
Posted 09/19/2014 07:01 PM   
allanmac: thanks for the power measurement tip. I tried GPU-z a while back and it was broken with Maxwell and I forgot all about it. Got the new version and it works fine. Looking at the clocks and TDP values during computation cleared up a few things for me. My fastest implementation runs at 1658 Gflops sustained and it's able to do that at a 1320 clock. TDP hovers between 98 and 99%. However, using different instruction ordering patterns and different register reuse and bank access patterns was giving me mysterious results. But looking at the clock and TDP I can now be more sure of why. Less register reuse increases register bank bandwidth and drops the clock down to 1306, and the one with the different ordering but same amount of reuse kept the clock at 1320 but TDP dropped down to 96%. This means it's stalling somewhere. I'm now pretty certain it's the register bank conflicts between ffmas and ongoing memory ops. Memory ops hold on to their register values for at least 20 clocks (which is why you need write-after-read barriers for memory operands). So during that time it makes sense you could get additional bank conflicts. I'm not sure which gets prioritized in the event of a conflict but either way could slow things down. Also, I looked at my op code flags for the ATOM op. It's clear there are holes for future expansion so I gave them a try with the new cuobjdump and found the F32x2 flag at least: ATOM: type 0x0002000000000000 .S32 0x0004000000000000 .U64 0x0006000000000000 .F32.FTZ.RN 0x0008000000000000 .F16x2.FTZ.RN 0x000a000000000000 .S64 0x0002000000000000 .64 You would think the F16x4 value would be using the "c" or "e" flag ("1" is used for the 64 bit addresssing 'E' flag). I also tried the "a" flag since S64 is supposed to be considered illegal with ATOM.ADD (you can see the "2" flag is overloaded depending on the mode: CAS uses .64). But cuobjdump had no issue with it.. maybe that support has been added as well. So, maxas supports F16x2 at least now (it's checked in if you want to play with it). New 980 arrives Monday. Eager to put it through it's paces.
allanmac: thanks for the power measurement tip. I tried GPU-z a while back and it was broken with Maxwell and I forgot all about it. Got the new version and it works fine. Looking at the clocks and TDP values during computation cleared up a few things for me. My fastest implementation runs at 1658 Gflops sustained and it's able to do that at a 1320 clock. TDP hovers between 98 and 99%.

However, using different instruction ordering patterns and different register reuse and bank access patterns was giving me mysterious results. But looking at the clock and TDP I can now be more sure of why. Less register reuse increases register bank bandwidth and drops the clock down to 1306, and the one with the different ordering but same amount of reuse kept the clock at 1320 but TDP dropped down to 96%. This means it's stalling somewhere. I'm now pretty certain it's the register bank conflicts between ffmas and ongoing memory ops. Memory ops hold on to their register values for at least 20 clocks (which is why you need write-after-read barriers for memory operands). So during that time it makes sense you could get additional bank conflicts. I'm not sure which gets prioritized in the event of a conflict but either way could slow things down.

Also, I looked at my op code flags for the ATOM op. It's clear there are holes for future expansion so I gave them a try with the new cuobjdump and found the F32x2 flag at least:

ATOM: type
0x0002000000000000 .S32
0x0004000000000000 .U64
0x0006000000000000 .F32.FTZ.RN
0x0008000000000000 .F16x2.FTZ.RN
0x000a000000000000 .S64
0x0002000000000000 .64

You would think the F16x4 value would be using the "c" or "e" flag ("1" is used for the 64 bit addresssing 'E' flag). I also tried the "a" flag since S64 is supposed to be considered illegal with ATOM.ADD (you can see the "2" flag is overloaded depending on the mode: CAS uses .64). But cuobjdump had no issue with it.. maybe that support has been added as well. So, maxas supports F16x2 at least now (it's checked in if you want to play with it).

New 980 arrives Monday. Eager to put it through it's paces.

#12
Posted 09/20/2014 10:40 PM   
I wonder why the clock is dropping? Is the 750 Ti overheating? Oh, the clock is probably dropping _because_ you're at 99% TDP. None of my benchmarks have managed to get beyond 70% TDP yet report 99% GPU and MEM so that's quite an accomplishment to max out the TDP with a CUDA app. :) If it's actually heat and not wattage, then you could try installing something like EVGA Precision X and max out your fan rpm's. If you haven't already, dumping all your metrics with "nvprof.exe -m all <sgemm.exe>" might reveal some more interesting stuff. It might take a while to capture all the metrics. That's cool that FP16x2 atomics are visible. Now I just wish that FP16 vector FMAs existed in the SMM (fma.sat.v2.f16). I feel sorry for your GTX 980. It probably thinks it's going to a quiet PC and will only play a few hours of video games each week.
I wonder why the clock is dropping? Is the 750 Ti overheating?

Oh, the clock is probably dropping _because_ you're at 99% TDP. None of my benchmarks have managed to get beyond 70% TDP yet report 99% GPU and MEM so that's quite an accomplishment to max out the TDP with a CUDA app. :)

If it's actually heat and not wattage, then you could try installing something like EVGA Precision X and max out your fan rpm's.

If you haven't already, dumping all your metrics with "nvprof.exe -m all <sgemm.exe>" might reveal some more interesting stuff. It might take a while to capture all the metrics.

That's cool that FP16x2 atomics are visible. Now I just wish that FP16 vector FMAs existed in the SMM (fma.sat.v2.f16).

I feel sorry for your GTX 980. It probably thinks it's going to a quiet PC and will only play a few hours of video games each week.

#13
Posted 09/21/2014 12:24 AM   
I had Precision X installed but didn't notice you could control the fan speed. I really hate "enthusiast" UI's. But upping the fan got me to 1660 Gflops sustained. The slower configs ran a touch faster but don't think they're temperature bound. I've included comments in the code on this issue here: https://code.google.com/p/maxas/source/browse/sgemm/sgemm128.sass and this might help too, though it's a work in progress (texture load mapping is already outdated).. https://code.google.com/p/maxas/wiki/sgemm Forgot about nvprof but it turns out not to give you much data beyond what you get from Nsight, which I've been leveraging heavily. My IPC issued/executed per SM is at 4.26 (out of a theoretical 4.29 with the level of dual issues in my code). Warp issue efficiency is at 99% and 15 out of 16 warps per SM are eligible on average. F16x2 FMA's would be cool to have. In fact, one of the driving factors behind wanting to implement my own sgemm was wanting to leverage the normalized float functionality of the texture loads. That way I can store my weight matrices with 16 or even 8 bit precision if I want. The other being needing to implement custom convolution kernels. The CudNN lib that Nvidia just released is cool and all, but it's still using MAGMA style sgemm :/ Ok, next up, I think it's a couple final features for the assembler, then I'll document it all (I promise). I want fully automatic bank conflict avoiding register allocation for the non-fixed registers (I'm most of the way there for this). And I want a simple built in compiler for C like expressions. With C syntax you can write all your tedious memory offset code normally and have all the assembly handled for you. Then you can focus your assembly purely on the performance sections of the kernel. So with those two features combined writing a kernel should be as painless as working in cuda c for the mundane stuff, but you get complete register control and the full power of sass for your performance code. I used to be one of those gamers and I doubt I've clocked enough gpu compute hours to remotely approach the number of graphics ops executed over past years... but I'm working on it :)
I had Precision X installed but didn't notice you could control the fan speed. I really hate "enthusiast" UI's. But upping the fan got me to 1660 Gflops sustained. The slower configs ran a touch faster but don't think they're temperature bound. I've included comments in the code on this issue here:

https://code.google.com/p/maxas/source/browse/sgemm/sgemm128.sass
and this might help too, though it's a work in progress (texture load mapping is already outdated)..
https://code.google.com/p/maxas/wiki/sgemm

Forgot about nvprof but it turns out not to give you much data beyond what you get from Nsight, which I've been leveraging heavily. My IPC issued/executed per SM is at 4.26 (out of a theoretical 4.29 with the level of dual issues in my code). Warp issue efficiency is at 99% and 15 out of 16 warps per SM are eligible on average.

F16x2 FMA's would be cool to have. In fact, one of the driving factors behind wanting to implement my own sgemm was wanting to leverage the normalized float functionality of the texture loads. That way I can store my weight matrices with 16 or even 8 bit precision if I want. The other being needing to implement custom convolution kernels. The CudNN lib that Nvidia just released is cool and all, but it's still using MAGMA style sgemm :/

Ok, next up, I think it's a couple final features for the assembler, then I'll document it all (I promise). I want fully automatic bank conflict avoiding register allocation for the non-fixed registers (I'm most of the way there for this). And I want a simple built in compiler for C like expressions. With C syntax you can write all your tedious memory offset code normally and have all the assembly handled for you. Then you can focus your assembly purely on the performance sections of the kernel. So with those two features combined writing a kernel should be as painless as working in cuda c for the mundane stuff, but you get complete register control and the full power of sass for your performance code.

I used to be one of those gamers and I doubt I've clocked enough gpu compute hours to remotely approach the number of graphics ops executed over past years... but I'm working on it :)

#14
Posted 09/21/2014 03:10 AM   
Very nice! I hade missed that the V2 had 96 KB of shared memory. I would be really interested to hear about peak CUFFT performance for FP32, anyone performed any such benchies? Last time I had a look it appeared to be a bandiwdth bound problem, perhaps the new larger L2 will help out massively for smaller FFT sizes? @scottgray: 98% utilization on SGEMM is very impressive! What utilization did you get on Kepler?
Very nice! I hade missed that the V2 had 96 KB of shared memory.

I would be really interested to hear about peak CUFFT performance for FP32, anyone performed any such benchies? Last time I had a look it appeared to be a bandiwdth bound problem, perhaps the new larger L2 will help out massively for smaller FFT sizes?


@scottgray: 98% utilization on SGEMM is very impressive! What utilization did you get on Kepler?

#15
Posted 09/22/2014 01:57 PM   
Scroll To Top

Add Reply