Cooling multiple Titan X: fan speed?

I have 4 Titan X (Pascal) used for compute (training neural nets at 100% utilization for days or weeks).

The server is a (home-built) DIGITS DevBox, in the high-air-flow Corsair 540 case.

The Titans routinely run at 84-86 degrees C, but the fan stays at 50-60% duty cycle. I know this is, technically, within the thermal limit of the card. But running 4 cards that hot for months on end has me worried - they are not cheap to replace.

How can I set the fans to a more aggressive profile? In a way that persists across reboot?

I hooked up a monitor, which let me use nvidia-settings to set ONE card’s fan to 100%. That dropped the GPU to 65 C, but the others are still hot (Enable GPU Fan Settings is not available for the others).

I’ve tried the command-line:

#!/bin/bash
nvidia-settings
-a “[gpu:0]/GPUFanControlState=1”
-a “[fan:0]/GPUCurrentFanSpeed=40” &

but always get:

** (nvidia-settings:32159): WARNING **: Could not open X display
ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

Indeed, as you notice GTX cards have poor (default) cooling / fan behavior.

Quick answer is: you need to be running an X server and if you do this in a terminal that’s not started by an X session, you need to tell which X display should nvidia-smi be started with (using the DISPLAY, environment variable, e.g. export DISPLAY=:0).

However, there is more to this story…

Firstly, the bigger problem is what someone from NVIDIA called “acoustic optimization” – the fan profile of the GTX cards (configured in the vBIOS). This “optimization” happily lets GPUs overheat instead of ramping up the fan and in fact, I have not seen GTX cards running by deafult with >60% fan speed. I guess gamers will either not care or use their own fixed fan settings. Annoyingly enough, this behavior can’t be changed (without flashing the vBIOS) except by overriding the fan speed with a fixed value (and the stupid part is that you can only use nvidia-settings which requires an X server running).

On all of our headless GPU servers we configure and run X for all GPUs, run init (upstart/systemd) scripts to start these (and automatically set coolbits, fan speed). Another init script sets up the devices with application clocks (read from config files), persistence mode, etc. This way you set it up once and forget it.

Some of these tips we included in a supplementary material of a recent benchmark paper I co-authored: http://goo.gl/FvkGC7 (admittedly some of that is a bit dated and has been considerably reworked, but it should act as a decent template)

BTW, the temperature possibly affecting the lifetime of the boards is not the only issue with this behavior. Due to throttling you’ll be loosing a lot of performance (I’ve seen up to 20% clock throttle) and in fact the higher the temperature the higher the power consumption (even at the same frequency) which makes the problem even worse, so I really recommend cranking up the fans.

PS: secondary issue, but on the hardware side the reference design fans of GTX is not particularly great IMO (at least not compared to some of the aftermarket ones) and frankly that Corsair case may not be optimal in terms of airflow given that it has holes everywhere and lacks of well-separated GPU compartment. Cases like the Supermicro workstations (7047GR-TRF) are far superior.

@pszilard That sounds like good and useful advice.

As someone who operates GPUs in their home office, I do appreciate quiet operation, and I definitely notice, by ear, when F@H drives GPU power use all the way to the power limit. So I can understand why NVIDIA cares about acoustics for consumer cards, just as HDD manufacturers do for consumer HDD: a good portion of their customers care.

The point about higher operating temperatures driving down the useful lifetime of semiconductor devices (Arrhenius equation) certainly applies and could be important, although I am not sure to what degree. In my experience, when GPUs die of old age, after many years of near-continuous operation, it is typically the on-board memory, not the processor, that goes bad.

I wonder whether the fan-speed operating limitations imposed by the default NVIDIA VBIOS might also have something to do with the lifetime of the fan? In general people tend to get very annoyed when their expensive electronic device becomes inoperable due to the failure of a relatively cheap component like a fan. The fans that have died on me most frequently are those cooling chipsets and PSUs. I have never had a fan on a GPU die on me. What is your experience in this regard, given that you routinely operate the GPU fans at elevated RPMs?

any one made it work?

I got this working; thank you very much to @pszilard!

I used his sample script and xorg.conf (from http://goo.gl/FvkGC7), with minor but important tweaks.

To work with Titan X drivers, I had to change two lines in the cool_gpu script:

GPUCurrentFanSpeed is a read-only attribute, needs to be GPUTargetFanSpeed instead:

nvscmd=“${set} -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=80”

Needed to change the driver search string: Titan X Pascal driver uses the string “NVIDIA” not “Nvidia”

pciid=lspci | sed -n -e '/VGA compatib.*[b]NVIDIA[/b]/s/^\(..\):\(..\).\(.\).*/printf "PCI:%d:%d:%d\\\\\\\\n" 0x\1 0x\2 0x\3;/p'

And I changed the xorg.conf to look for the drivers in /usr/lib instead of /usr/lib64 (I’m running Ubuntu 14.04):

ModulePath “/usr/lib/xorg/modules”
ModulePath “/usr/lib/xorg/modules/extensions”
ModulePath “/usr/lib/xorg/modules/drivers”

I have the exact same problem with our Titan X (Pascal), which accelerates Machine Learning applications.

Is it a problem to constantly set the fan speed to e.g. 75%? Does this “harm” the fan? Like mechanical attrition maybe?

Our CUDA-Workstation runs 24/7, thus the fan is really constantly running at high speeds now.

I had the same issue with the original GTX Titans, and spent ages messing with connecting monitors and running X servers for each card (to be able to control the fan). Also tried reflashing the firmwares with different fan profiles.

In the end I got water blocks for all my cards (mostly Aquacomputer parts, and EK radiators).
I’m kind of hesitant to recommend it to anyone else, as there’s a lot that can go wrong (galvanic corrosion, algea, leaks, etc), and it cost me a pretty penny, but it’s a lot more quiet and I don’t need to worry about temps, even in summer without AC.