Titan V boost-clock issue

dvdw · December 27, 2017, 10:20pm

Hi,

with the latest driver 388.71, it seems that there is nearly no boost. Although no limit is reached the card limits itself at 1355 MHz and runs mainly at 1200 Mhz. The behavior is similar with OpenCL- and Cuda-Code. The card is not connected to a screen and sits in a pciex8 slot.

Is this a driver issue?

Regards
Dirk

njuffa · December 27, 2017, 10:51pm

What is the temperature reported by nvidia-smi? Ever since the Pascal architecture, the upper regions of automatically controlled boost clocks appear to be quite sensitive to GPU temperature, with the highest boost clocks only reachable as long as GPU temperature stays below about 60 degree Celsius.

Boost clocks on GPUs are best thought of as a bonus feature. The only thing the vendor guarantees is operation at the base clock (1200 MHz in the case of the Titan V). Everything above that is icing on the cake, subject to environmental parameters such as temperature and power draw and the details of the clock boosting control software. Since manufacturing tolerances play into the environmental factors, there can also be differences between different physical GPUs even if they are the same model.

It is always possible that newer drivers include modifications to the boost clock control software. But without a tightly controlled experiment (identical operating conditions, only driver version changing) it would be hard to assess what the exact nature of those differences is, since we are not privy to the details of the clock control algorithm and its (presumably tuneable) parameters.

dvdw · December 27, 2017, 11:02pm

Hi,

thanks for the reply - there is a minor misunderstanding here - I did not test the earlier driver, but expected the card to raise the clock when running within all limits. To make it more reproducible I did the test with luxmark instead of my own software - temperature never exceeds 70 degrees - it is below 60 for more than a minute. It does not reach the advertised boost clock of 1455 even if below 50 degrees. I tested overclocking manually a few minutes ago which works fine - it simply does not boost.

Here is snippet of SMI query when luxmark is running:

Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : 91 C
Memory Current Temp : 47 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 112.42 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1335 MHz
SM : 1335 MHz
Memory : 850 MHz
Video : 1200 MHz
Applications Clocks
Graphics : 1200 MHz
Memory : 850 MHz
Default Applications Clocks
Graphics : 1200 MHz
Memory : 850 MHz
Max Clocks
Graphics : 1912 MHz
SM : 1912 MHz
Memory : 850 MHz
Video : 1717 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 2948
Type : C
Name : E:\Tools\LuxMark-v3.1\luxmark.exe
Used GPU Memory : Not available in WDDM driver model

Regards
Dirk

njuffa · December 27, 2017, 11:30pm

This is simply a “will not exceed” rate, with no promises that any particular application will ever reach it on any particular physical GPU. We don’t know how the boost clock control works in detail. It may well take sensor data from various particular points around the GPU and the attached memory, and reduce clocks if any one of them exceeds preset limits, to ensure stable GPU operation.

Published graphs of clocks vs temperature for graphics applications on Pascal-family CPUs typically shows a rapid fall of the boost clock within the first 30 seconds of application run. You could try repeating your experiment with the fan speed manually forced to 100% if you have a means of doing so. It is also possible that some sort of power limit rather than a thermal limit is being reached, check whether nvidia-smi allows you to dial in a higher global power limit for the GPU (i.e. Enforced Power Limit = Max Power Limit)

The Titan V is a brand new product, there isn’t a large body of published reports one can draw on to compare the experience with boost clock behavior across a statistically relevant number of GPUs. It seems weird that your nvidia-smi output shows “Auto Boost: N/A”, for example.

[Later:] Here is a gaming site that looks into Titan V boost clocks. I am not familiar with the site so consider this link being provided for entertainment value only:

https://www.gamersnexus.net/guides/3171-nvidia-titan-v-power-consumption-thermals-and-clock-behavior

dvdw · December 28, 2017, 3:05pm

So I did some more testing,

it is the known behavior that compute tasks are not entering the highest power state. It boosts fine with OpenGL tasks up to 1800 Mhz. Unfortunately the card will be used as a compute device only :-(
The reason why Nvidia does this is unclear to me.

njuffa · December 28, 2017, 4:19pm

The portions of a chip heavily exercised by a graphics application are not the same as those heavily exercised by compute applications. For example, graphics apps don’t use shared memory (at least to my knowledge), while compute apps don’t use rasterization units.

Heavy power draw in particular units can lead to local voltage drops, which causes transistors to slow down, which can in turn cause the unit to malfunction (e.g. violation of setup or hold times). Power draw is roughly a linear function of transistor switching speed, so clocks may have to be limited to prevent such voltage drops. This is just one of several possible mechanisms that can drive clock limits in auto boosting regimes.

You can observe similar effects in the clock boosting of Intel CPUs (Turbo Boost), which apply different frequency limits depending on what kind of instructions, and thus functional units, are used. If AVX-512 comes into play, clock boost limits are significantly lower, for example.

Again I would suggest to focus on the guaranteed base frequencies when selecting processors, rather than focusing on boost clocks that may or may not be achievable for a particular app on a particular device. It should be understood that marketing people the world over will latch onto the highest number in sight and run with it, ignoring any attached caveats (as an engineer, I learned this the hard way).