One weird trick to get a Maxwell v2 GPU to reach its max memory clock !

I’m not sure if this has been covered already but if you are wondering why your CUDA kernels don’t seem to be prodding your Maxwell v2 GPU to its max rated memory clock speed then read this thread.

Kudos to the people on that thread for recognizing that compute applications weren’t achieving the same memory clocks as graphics applications.

In my case, I have an EVGA GTX 980 SC ACX 2.0 that immediately boosts to a GPU/MEM clock of 1392/1502 MHz.

However, the card is rated for a max MEM clock of 1752 MHz yet I had never seen a CUDA kernel boost beyond 1502 MHz.

After reading the above thread, I queried the supported clocks:

nvidia-smi -i <device id> -q -d SUPPORTED_CLOCKS | more

… and set the application clocks to the max supported for this card:

nvidia-smi -i <device id> -ac 3505,1531

The results are impressive!

The CUDA Samples “Bandwidth Test” now reports almost 200 GB/s instead of the previous ~160 GB/s.

My HotSort benchmark leapt as well! The purple line shows the impact of the improved mem clock boost.

I wonder why compute kernels default to a lower power state?

This is on Win7/x64 + 358.87.

Thanks for highlighting the use of application clocks! I have repeatedly recommended them to forum participants over the past couple of years, but a worked example accompanied by a nice graph is worth a thousand words on the topic :-)

Thanks @njuffa.

The surprise for CUDA devs with GM20X GPUs is that, according to nvidia-smi, the power state for Type “C” (compute) tasks doesn’t default to the maximum supported MEM clock. Type “C+G” (compute+graphics) tasks operate as expected. The previously mentioned thread really digs into it.

It’s always pleasing to get a free and safe performance boost!

Now I can’t wait to hear from CudaaduC whether the use of application clocks enables him to boost his app over the performance threshold he has been approaching (from below) for a while.

I agree the observed auto-boosting behavior doesn’t make intuitive sense, maybe someone from NVIDIA can enlighten us on this issue.

Ha, I had the same thought and am also waiting for @CudaaduC to drop some TITAN X benchmarks!

Since I have a reference Titan X in the same PC as a EVGA superclocked Titan X, I just upped the reference version Titan X to;

nvidia-smi -i 0 -ac 3300,1304

and that already pushed my multi-gpu 512^3 RabbitCT back projection time down to my target time of 740 ms (including all memory copies both directions).

At 220 ms for 256^3, and ~5 seconds for 1024^3 including memory copy times(without this new boost). Will benchmark the new iterations over the next few days.

All this without using the texture bi-linear interpolation, which all the top finishers use.

Forgive the “dumb” question, but is this technically “overclocking” or is it just using more of the intended capacity of the GPU?

Shouldn’t a reference TITAN X be showing a maximum supported MEM clock of 3505 since it’s rated at 7.0 GHz?

I don’t think setting the GPU clock above the TITAN X’s max GPU boost (>=1075 MHz?) will have any impact.

You should be able to observe your GPU/MEM clocks with an app like GPU-Z.

For example, when I set my GPU to “3505,1531” I’m not seeing a GPU clock boost beyond the previously observed 1392 MHz.

But the default MEM clock for CUDA kernels was far below the model’s rated speed and did increase to 3505 (7010 Gbps).

Shorter answer: I don’t think this can be considered overclocking.

Maybe NVIDIA has a reason for clamping the MEM-clock when running CUDA/CL apps?

Anyway, here’s the dump from my GTX 980:

$> nvidia-smi -i 2 -q -d SUPPORTED_CLOCKS | more

==============NVSMI LOG==============

Timestamp                           : Wed Nov 04 15:03:40 2015
Driver Version                      : 358.87

Attached GPUs                       : 3
GPU 0000:05:00.0
    Supported Clocks
        Memory                      : 3505 MHz  <-- this mfgr. rated MEM boost was _never_ being
            Graphics                : 1531 MHz      reached by CUDA benches until nvidia-smi cmd
            Graphics                : 1519 MHz
                           ⁞
            Graphics                : 1405 MHz
            Graphics                : 1392 MHz  <-- default GPU boost
            Graphics                : 1380 MHz
            Graphics                : 1367 MHz  <-- mfgr. rated GPU boost
            Graphics                : 1354 MHz
                           ⁞

Good question. The application clocks set up by NVIDIA for a particular reference device are obviously sanctioned by them and these GPUs should function flawlessly at the highest application clock settable through nvidia-smi. What NVIDIA does not guarantee when choosing high application clocks is that clock throttling due to exceeding the power or thermal limits for the card will not occur.

From what I have seen, many applications run just fine at the highest application clocks while staying well clear of the throttling limits. That is why I always recommend that people try those application clocks with their apps.

As far as I understand, the GPU default clocks are chosen such that clock throttling should never occur under normal operating conditions, no matter what the application. This is important where hundreds or thousands of GPUs must be run continuously, all at the same performance, as part of a cluster.

What I do not know whether third-party devices allow the setting of application clocks that exceed the application clocks settable on NVIDIA’s reference devices. Personally, I have always been wary of “super-clocked” devices that run at higher than NVIDIA’s reference frequencies out of the box. My (potentially unjustified) concern is that the third party vendors use graphics rather than compute applications to qualify the parts at those increased clocks.

So as long as you run your GPUs at the NVIDIA-approved application clocks, you are not overclocking, as the parts are designed and qualified by NVIDIA for those clocks, meaning you are not eating into the design margin of the part. As I explained in a post a while back, the design margin is there to absorb manufacturing variations as well as aging effects that slow down transistors and wires over time.

Tried Jimmy Petterson’s sum reduction bandwidth test from a few years back after this trick and this is the new output;

GeForce GTX TITAN X @ 336.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         185.32                  55.08   22.6             Pass
 2097152         228.87                  68.02   36.7             Pass
 4194304         257.38                  76.49   65.2             Pass
 8388608         279.34                  83.02   120.1            Pass
 16777216        291.66                  86.68   230.1            Pass
 33554432        298.23                  88.63   450.1            Pass
 67108864        301.58                  89.63   890.1            Pass
 134217728       303.35                  90.15   1769.8                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        289.22                  85.95   203.0            Pass
 14680119        289.19                  85.95   203.0            Pass
 18875600        285.96                  84.99   264.0            Pass
 7434886         172.57                  51.29   172.3            Pass
 13324075        260.93                  77.55   204.3            Pass
 15764213        272.75                  81.06   231.2            Pass
 1850154         68.17           20.26   108.6            Pass
 4991241         155.73                  46.28   128.2            Pass
Press any key to continue . . .

303.35 GBs is the highest I have ever seen for a sum reduction!

It would perhaps be relevant to hear about GFLOPS/watt or GB/s / watt before and after? :-)

Sure, but not sure how to calculate based on the GBs/watt.

Without that NVSMI adjustment to the maximum supported memory clock for the same GTX Titan X this is the output;

GeForce GTX TITAN X @ 336.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         178.02                  52.91   23.6             Pass
 2097152         218.36                  64.90   38.4             Pass
 4194304         244.16                  72.56   68.7             Pass
 8388608         263.64                  78.35   127.3            Pass
 16777216        274.35                  81.53   244.6            Pass
 33554432        280.17                  83.26   479.1            Pass
 67108864        283.10                  84.14   948.2            Pass
 134217728       284.67                  84.60   1886.0                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        272.11                  80.87   215.8            Pass
 14680119        272.17                  80.89   215.7            Pass
 18875600        269.12                  79.98   280.6            Pass
 7434886         165.63                  49.22   179.6            Pass
 13324075        246.52                  73.27   216.2            Pass
 15764213        257.22                  76.44   245.1            Pass
 1850154         66.55           19.78   111.2            Pass
 4991241         149.80                  44.52   133.3            Pass

So basically a bit more than a 5% boost with the set of the memory clock to the max supported 3505 MHz per my earlier post.

You can try this:

nvidia-smi -i <device> --loop-ms=333 --format=csv,noheader --query-gpu=power.draw

Which outputs a new measurement every 333 msecs:

13.06 W
13.06 W
55.48 W
130.90 W
133.95 W
134.06 W
133.47 W
133.57 W
134.24 W
134.63 W
134.44 W
57.89 W
53.60 W

The documentation for this property is:

  • "power.draw" The last measured power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts.

@allanmac: Do I understand correctly that you have seen a TITAN X operate at graphics clocks of 1392 MHz in a compute-only application with only the application clock tweaked using nvidia-smi?That would be surprising as I’ve never seen our TITAN X go above 1202 Mhz (sometimes, depending on its mood it gets stuck at 1177 or 1189) even though I have the application clocks set to 1391,3505.

Also note that without overriding the default fan speed limit of 60% of the max RPM all non-Tesla cards (including Quadro) will easily end up throttling. AFAIK there is still no other solution for headless servers but running dummy X server and fixing the fan speed with nvidia-settings. Does anybody have different experience?

@pszilard, I have an EVGA GTX 980 card that boosts to 1392 MHz (which is 25 MHz higher than EVGA guaranteed).

All my stats and comments were in reference to this specific SKU: EVGA GTX 980 SC ACX 2.0.

I think this entire thread can be summarized for CUDA devs as:

  1. If you're a gamer then do nothing.
  2. If you don't have a Maxwell2 (GM20x) card then do nothing.
  3. If you're not benchmarking and/or don't care about an extra 16% bandwidth then do nothing.
  4. Otherwise, using the nvidia-smi utility, dump the supported clocks.
  5. Identify the highest supported memory and graphic clocks.
  6. Use the nvidia-smi utility to set the application clocks.

Dump & Identify:

Set (as Administrator):

AFAIK, setting the Graphics clock to its highest supported value will have no impact since it seems that CUDA apps are already being granted a card-specific maximum clock speed.

Graphics clock speeds are not an issue with Maxwell2 and CUDA – just Memory clock speeds.

It remains unexplained why GM20x Memory clocks are not boosting to the card’s maximum supported Memory clock without this incantation.

I think most of us are assuming that this behavior is either an oversight or by design… and that bumping the card to its manufacturer-specific peak performance for CUDA apps is not a hazardous or unsupported operation.

FWIW, my GTX 980 runs at these supported clocks for hours at a time and, so far, there are no errors or magic smoke! :)

Do you mean that when you run some compute-intensive code the reported sustained SM clock is 1392 MHz?

I’m asking because this is contrary to our experience with GTX TITAN X and 980 Ti cards (and similar effect has been observed with Kepler TITAN’s too): the SM clock never goes above 1163-1202 MHz, even when the application clocks are set to 1391 MHz (AFAIR it’s the same on our vanilla TITAN and 980 Ti cards). The stable clock is not entirely consistent either, but varies from run to run by 3-5% (between 1163-1189 MHz rarely 1202). This is of course without any power/thermal violations.

I’m surprised that you do not mention overheating as an issue with non-Tesla cards. In my experience none of the GeForce (nor the Quadro) cards can sustain their peak boost clock under heavy load (while staying below the power limit) due to the fan behavior. AFAIK this is due to the fan being “optimized for acoustics”. Fan speed ramps up quite slowly and in my experience it does not go above 60% rotation as reported by nvidia-smi without manually overriding the fan speed through the graphical tool (nvidia-settings or its command line interface). However, the override requires an X sever to be running - which is not ideal for a headless server. Of course, the NVRAM can be flashed too, but that’s not something I’d encourage.

Additionally, the aforementioned slight variations in the clock boosting can be problematic for benchmarking, especially with short kernels. Moreover, in my experience, while the “Auto boost” does lead to the SM clock boosting higher than the application clocks set (e.g. to default), the jump from the set application clock value to the max can take seconds which can also influence benchmark measurements.

All my observations are made based on tests using well-cooled (mostly rack-mounted) headless Linux servers running using 352.x drivers. Could it be that the Windows driver behaves so differently? I was under the impression that the fan behavior is programmed in the NVRAM.

Interesting. We have run multi-GPU simulations using both the EVGA SC 980 ACX and the EVGA SC Titan X for weeks at a time without issue(Windows 7 x64 OS). We do not overclock beyond the the factory settings, so maybe that makes the difference.

When I monitored the temp of the GTX 980 GPUs they never got above 73 degrees Celsius. Have not done the same test using the Titan X GPUs.

Have you had a failure with a Geforce card without overclocking beyond the factory settings?

Yes, sustained.

Plenty of GTX 980 models are sold with extremely high clocks. As I noted above, this particular SKU has a guaranteed boost of 1362.

The much more powerful 980 Ti and Titan X cards aren’t clocked as high. Googling shows the highest air cooled TitanX you can buy from EVGA boosts to 1216.

So it sounds to me like your cards are working fine. :)

We’re not using factory-overclocked cards nor do we overclock them. I’ve had no failures, but these cards all end up throttling due to overheating. Here’s some data I just gathered to illustrate the issue: https://goo.gl/mW4O4S. I was running the

<s>nbody -benchmark</s>

Correction

nbody -benchmark -numbodies=256000

in an infinite loop on a reference TITAN X and did measurements using this command:

nvidia-smi dmon -d 2 -s pucvm

Unfortunately nvidia-smi dmon does not monitor the fan (note to self, should file and RFI), but during the first phase of the plot, until about 670 s the fan ran as designed with default settings, then I did the fan speed overriding trick (setting it to 90% of the max RPM) and as expected the temperature went down and frequency immediately went back up. As the nbody SDK sample code pushes the card to the max TDP, the stable clock was around 1113 MHz which is >10% higher clock with the fan issue fixed.

I’ve just checked with our vanilla GTX 980 too and observed the same, starting out at ~1252 MHz and as soon as it hits 80 C it gradually drops down to 1126 MHz.

Could somebody do a similar test on Windows and report back whether they see the same behavior?

Yes, indeed, for some definition of fine. :)

For the above (and previously outlined) reasons, I feel like it’s not the safest to suggest CUDA devs that

I should’ve added a caveat: lots of airflow helps!

NBody is a great application to use as a burn-in test since it’s so intense.

nvidia-smi reports that the GTX 980 is pulling a peak 176W out of its 185W cap.

That’s as high as I’ve ever seen for a CUDA app.

After 15 minutes, the GTX 980 in my case is running nbody at a steady 70-71C and a fan speed of 25%.

The clocks remain rock solid at 1392/1752 GPU/MEM.

Why? Probably because my case has a 200mm fan right in front of the GPUs.

[I am a slow typist and wrote the following without the benefit of seeing allanmac’s post above]

That is a pretty comprehensive set of data, it will be interesting to see comparison data from CudaaduC.

What was the approximate ambient air temperature when these measurements were taken? What kind of a case (enclosure) is being used to house the system components? For example, sticking an actively cooled GPU into a 1U pizza box server enclosure would be a BAD IDEA™, air-flow wise. Is there any possibility that air flow to the GPU could be obstructed by other components (including other GPUs) or cabling, or that the ingress air stream for the GPU is pre-heated by other components in the case, such as the CPU and/or the power supply?

For PSUs, I would recommend 80+ Platinum models these days; their high efficiency minimizes waste heat. CPUs require additional DC-DC conversion (as do GPUs). Unfortunately it seems to be really difficult to assess the efficiency of that conversion, as it seems close to impossible to track down that information. I see that some motherboard manufacturers claim use of significantly more efficient designs. In the past DC-DC conversion modules for CPUs usually were around 80% efficient, from what I gather from recent specification sheets, up to 93% efficiency are technically feasible now (but may be pricey). The power consumption of CPUs themselves is well documented of course, most of that heats the air in the case and may lead to higher GPU air intake temperatures.