I’m not sure if this has been covered already but if you are wondering why your CUDA kernels don’t seem to be prodding your Maxwell v2 GPU to its max rated memory clock speed then read this thread.
Kudos to the people on that thread for recognizing that compute applications weren’t achieving the same memory clocks as graphics applications.
In my case, I have an EVGA GTX 980 SC ACX 2.0 that immediately boosts to a GPU/MEM clock of 1392/1502 MHz.
However, the card is rated for a max MEM clock of 1752 MHz yet I had never seen a CUDA kernel boost beyond 1502 MHz.
After reading the above thread, I queried the supported clocks:
nvidia-smi -i <device id> -q -d SUPPORTED_CLOCKS | more
… and set the application clocks to the max supported for this card:
nvidia-smi -i <device id> -ac 3505,1531
The results are impressive!
The CUDA Samples “Bandwidth Test” now reports almost 200 GB/s instead of the previous ~160 GB/s.
My HotSort benchmark leapt as well! The purple line shows the impact of the improved mem clock boost.
I wonder why compute kernels default to a lower power state?
Thanks for highlighting the use of application clocks! I have repeatedly recommended them to forum participants over the past couple of years, but a worked example accompanied by a nice graph is worth a thousand words on the topic :-)
The surprise for CUDA devs with GM20X GPUs is that, according to nvidia-smi, the power state for Type “C” (compute) tasks doesn’t default to the maximum supported MEM clock. Type “C+G” (compute+graphics) tasks operate as expected. The previously mentioned thread really digs into it.
It’s always pleasing to get a free and safe performance boost!
Now I can’t wait to hear from CudaaduC whether the use of application clocks enables him to boost his app over the performance threshold he has been approaching (from below) for a while.
I agree the observed auto-boosting behavior doesn’t make intuitive sense, maybe someone from NVIDIA can enlighten us on this issue.
Since I have a reference Titan X in the same PC as a EVGA superclocked Titan X, I just upped the reference version Titan X to;
nvidia-smi -i 0 -ac 3300,1304
and that already pushed my multi-gpu 512^3 RabbitCT back projection time down to my target time of 740 ms (including all memory copies both directions).
At 220 ms for 256^3, and ~5 seconds for 1024^3 including memory copy times(without this new boost). Will benchmark the new iterations over the next few days.
All this without using the texture bi-linear interpolation, which all the top finishers use.
Forgive the “dumb” question, but is this technically “overclocking” or is it just using more of the intended capacity of the GPU?
Good question. The application clocks set up by NVIDIA for a particular reference device are obviously sanctioned by them and these GPUs should function flawlessly at the highest application clock settable through nvidia-smi. What NVIDIA does not guarantee when choosing high application clocks is that clock throttling due to exceeding the power or thermal limits for the card will not occur.
From what I have seen, many applications run just fine at the highest application clocks while staying well clear of the throttling limits. That is why I always recommend that people try those application clocks with their apps.
As far as I understand, the GPU default clocks are chosen such that clock throttling should never occur under normal operating conditions, no matter what the application. This is important where hundreds or thousands of GPUs must be run continuously, all at the same performance, as part of a cluster.
What I do not know whether third-party devices allow the setting of application clocks that exceed the application clocks settable on NVIDIA’s reference devices. Personally, I have always been wary of “super-clocked” devices that run at higher than NVIDIA’s reference frequencies out of the box. My (potentially unjustified) concern is that the third party vendors use graphics rather than compute applications to qualify the parts at those increased clocks.
So as long as you run your GPUs at the NVIDIA-approved application clocks, you are not overclocking, as the parts are designed and qualified by NVIDIA for those clocks, meaning you are not eating into the design margin of the part. As I explained in a post a while back, the design margin is there to absorb manufacturing variations as well as aging effects that slow down transistors and wires over time.
13.06 W
13.06 W
55.48 W
130.90 W
133.95 W
134.06 W
133.47 W
133.57 W
134.24 W
134.63 W
134.44 W
57.89 W
53.60 W
The documentation for this property is:
"power.draw"
The last measured power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts.
@allanmac: Do I understand correctly that you have seen a TITAN X operate at graphics clocks of 1392 MHz in a compute-only application with only the application clock tweaked using nvidia-smi?That would be surprising as I’ve never seen our TITAN X go above 1202 Mhz (sometimes, depending on its mood it gets stuck at 1177 or 1189) even though I have the application clocks set to 1391,3505.
Also note that without overriding the default fan speed limit of 60% of the max RPM all non-Tesla cards (including Quadro) will easily end up throttling. AFAIK there is still no other solution for headless servers but running dummy X server and fixing the fan speed with nvidia-settings. Does anybody have different experience?
I think this entire thread can be summarized for CUDA devs as:
If you're a gamer then do nothing.
If you don't have a Maxwell2 (GM20x) card then do nothing.
If you're not benchmarking and/or don't care about an extra 16% bandwidth then do nothing.
Otherwise, using the nvidia-smi utility, dump the supported clocks.
Identify the highest supported memory and graphic clocks.
Use the nvidia-smi utility to set the application clocks.
Dump & Identify:
Set (as Administrator):
AFAIK, setting the Graphics clock to its highest supported value will have no impact since it seems that CUDA apps are already being granted a card-specific maximum clock speed.
Graphics clock speeds are not an issue with Maxwell2 and CUDA – just Memory clock speeds.
It remains unexplained why GM20x Memory clocks are not boosting to the card’s maximum supported Memory clock without this incantation.
I think most of us are assuming that this behavior is either an oversight or by design… and that bumping the card to its manufacturer-specific peak performance for CUDA apps is not a hazardous or unsupported operation.
FWIW, my GTX 980 runs at these supported clocks for hours at a time and, so far, there are no errors or magic smoke! :)
Do you mean that when you run some compute-intensive code the reported sustained SM clock is 1392 MHz?
I’m asking because this is contrary to our experience with GTX TITAN X and 980 Ti cards (and similar effect has been observed with Kepler TITAN’s too): the SM clock never goes above 1163-1202 MHz, even when the application clocks are set to 1391 MHz (AFAIR it’s the same on our vanilla TITAN and 980 Ti cards). The stable clock is not entirely consistent either, but varies from run to run by 3-5% (between 1163-1189 MHz rarely 1202). This is of course without any power/thermal violations.
I’m surprised that you do not mention overheating as an issue with non-Tesla cards. In my experience none of the GeForce (nor the Quadro) cards can sustain their peak boost clock under heavy load (while staying below the power limit) due to the fan behavior. AFAIK this is due to the fan being “optimized for acoustics”. Fan speed ramps up quite slowly and in my experience it does not go above 60% rotation as reported by nvidia-smi without manually overriding the fan speed through the graphical tool (nvidia-settings or its command line interface). However, the override requires an X sever to be running - which is not ideal for a headless server. Of course, the NVRAM can be flashed too, but that’s not something I’d encourage.
Additionally, the aforementioned slight variations in the clock boosting can be problematic for benchmarking, especially with short kernels. Moreover, in my experience, while the “Auto boost” does lead to the SM clock boosting higher than the application clocks set (e.g. to default), the jump from the set application clock value to the max can take seconds which can also influence benchmark measurements.
All my observations are made based on tests using well-cooled (mostly rack-mounted) headless Linux servers running using 352.x drivers. Could it be that the Windows driver behaves so differently? I was under the impression that the fan behavior is programmed in the NVRAM.
Interesting. We have run multi-GPU simulations using both the EVGA SC 980 ACX and the EVGA SC Titan X for weeks at a time without issue(Windows 7 x64 OS). We do not overclock beyond the the factory settings, so maybe that makes the difference.
When I monitored the temp of the GTX 980 GPUs they never got above 73 degrees Celsius. Have not done the same test using the Titan X GPUs.
Have you had a failure with a Geforce card without overclocking beyond the factory settings?
Plenty of GTX 980 models are sold with extremely high clocks. As I noted above, this particular SKU has a guaranteed boost of 1362.
The much more powerful 980 Ti and Titan X cards aren’t clocked as high. Googling shows the highest air cooled TitanX you can buy from EVGA boosts to 1216.
So it sounds to me like your cards are working fine. :)
We’re not using factory-overclocked cards nor do we overclock them. I’ve had no failures, but these cards all end up throttling due to overheating. Here’s some data I just gathered to illustrate the issue: https://goo.gl/mW4O4S. I was running the
<s>nbody -benchmark</s>
Correction
nbody -benchmark -numbodies=256000
in an infinite loop on a reference TITAN X and did measurements using this command:
nvidia-smi dmon -d 2 -s pucvm
Unfortunately nvidia-smi dmon does not monitor the fan (note to self, should file and RFI), but during the first phase of the plot, until about 670 s the fan ran as designed with default settings, then I did the fan speed overriding trick (setting it to 90% of the max RPM) and as expected the temperature went down and frequency immediately went back up. As the nbody SDK sample code pushes the card to the max TDP, the stable clock was around 1113 MHz which is >10% higher clock with the fan issue fixed.
I’ve just checked with our vanilla GTX 980 too and observed the same, starting out at ~1252 MHz and as soon as it hits 80 C it gradually drops down to 1126 MHz.
Could somebody do a similar test on Windows and report back whether they see the same behavior?
Yes, indeed, for some definition of fine. :)
For the above (and previously outlined) reasons, I feel like it’s not the safest to suggest CUDA devs that
[I am a slow typist and wrote the following without the benefit of seeing allanmac’s post above]
That is a pretty comprehensive set of data, it will be interesting to see comparison data from CudaaduC.
What was the approximate ambient air temperature when these measurements were taken? What kind of a case (enclosure) is being used to house the system components? For example, sticking an actively cooled GPU into a 1U pizza box server enclosure would be a BAD IDEA™, air-flow wise. Is there any possibility that air flow to the GPU could be obstructed by other components (including other GPUs) or cabling, or that the ingress air stream for the GPU is pre-heated by other components in the case, such as the CPU and/or the power supply?
For PSUs, I would recommend 80+ Platinum models these days; their high efficiency minimizes waste heat. CPUs require additional DC-DC conversion (as do GPUs). Unfortunately it seems to be really difficult to assess the efficiency of that conversion, as it seems close to impossible to track down that information. I see that some motherboard manufacturers claim use of significantly more efficient designs. In the past DC-DC conversion modules for CPUs usually were around 80% efficient, from what I gather from recent specification sheets, up to 93% efficiency are technically feasible now (but may be pricey). The power consumption of CPUs themselves is well documented of course, most of that heats the air in the case and may lead to higher GPU air intake temperatures.