GTX460 on ubuntu: GPU temperature too high and then gpu is lost

There are three gpus on the computer, all of them are GTX 460, but the gpu0 temperature is always high than the others. nothing runs on gpu, and the temperature of gpu0 is 60, at the same time, the others is about 50.
when i run caffe mnist on gpu0, the temperature of gpu0 is getting higher and higher, over the 103, and then “ERROR: GPU is lost”…by the way, I think fan speed looks normal.


before run mnist:

Thu Jun 22 10:09:45 2017
±-----------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 460 Off | 0000:04:00.0 N/A | N/A |
| 42% 62C P3 N/A / N/A | 56MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 460 Off | 0000:05:00.0 N/A | N/A |
| 20% 47C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 460 Off | 0000:06:00.0 N/A | N/A |
| 20% 42C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
±----------------------------------------------------------------------------+


after run mnist:

Thu Jun 22 11:05:33 2017
±-----------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 460 Off | 0000:04:00.0 N/A | N/A |
|100% 99C P0 N/A / N/A | 124MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 460 Off | 0000:05:00.0 N/A | N/A |
| 26% 53C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 460 Off | 0000:06:00.0 N/A | N/A |
| 20% 44C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
±----------------------------------------------------------------------------+

Thu Jun 22 11:06:11 2017
±-----------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 460 ERR! | ERR! ERR! | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 124MiB / 1023MiB | ERR! ERR! |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 460 Off | 0000:05:00.0 N/A | N/A |
| 25% 52C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 460 Off | 0000:06:00.0 N/A | N/A |
| 20% 44C P12 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 ERROR: GPU is lost |
| 1 Not Supported |
| 2 Not Supported |
±----------------------------------------------------------------------------+

the other information:

mayang@gpu-cluster-5:~$ lspci | grep ‘NVIDIA’
04:00.0 VGA compatible controller: NVIDIA Corporation GF104 [GeForce GTX 460] (rev ff)
04:00.1 Audio device: NVIDIA Corporation GF104 High Definition Audio Controller (rev ff)
05:00.0 VGA compatible controller: NVIDIA Corporation GF104 [GeForce GTX 460] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GF104 High Definition Audio Controller (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GF104 [GeForce GTX 460] (rev a1)
06:00.1 Audio device: NVIDIA Corporation GF104 High Definition Audio Controller (rev a1)
mayang@gpu-cluster-5:~$ dmesg | grep ‘NVRM’
[ 13.361468] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
[74341.542523] NVRM: GPU at 0000:04:00.0 has fallen off the bus.
[74341.542535] NVRM: A GPU crash dump has been created. If possible, please run
[74341.542535] NVRM: nvidia-bug-report.sh as root to collect this data before
[74341.542535] NVRM: the NVIDIA kernel module is unloaded.


If your GPU is reporting a certain temperature it’s not lying and that’s a reason to get worried. I’d even say that anything above 80C is a very bad temperature to sustain.

Being you I’d replace a thermal paste and made sure there’s a good ventilation in your case.