Tesla K80 overheating

bclark · July 10, 2015, 2:28pm

We are running an app on a K80 that for the first 2-3 minutes does just fine - but the temperature of one of the GPUs goes up steadily to 90C after 3 minutes and the clock speeds then throttle to between a third to an eighth of what they were. There is only a passive heat sink. Has anyone else overcome this hurdle?
TIA.

$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 340.32 Driver Version: 340.32 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 91C P0 110W / 149W | 940MiB / 11519MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 63C P0 120W / 149W | 940MiB / 11519MiB | 64% Default |
+-------------------------------+----------------------+----------------------+

bclark · July 10, 2015, 2:38pm

this is how it looks in normal operation before throttling induces overruns:

+------------------------------------------------------+                       
| NVIDIA-SMI 340.32     Driver Version: 340.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:05:00.0     Off |                    0 |
| N/A   75C    P0   118W / 149W |    793MiB / 11519MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   55C    P0   128W / 149W |    793MiB / 11519MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+

droettger · July 10, 2015, 3:04pm

Are you running it inside a certified server configuration?

Please read this thread and the links in them:
[url]https://devtalk.nvidia.com/default/topic/830470/?comment=4522937[/url]

bclark · July 10, 2015, 4:15pm

We have an iHawk GPU Workbench CUDA server built up [including the K80] by Concurrent Computer Corp and we are doing only CUDA, not graphics on the K80.

One of the links mentions:

So “certified” by who?

droettger · July 11, 2015, 1:57pm

Maybe I picked the wrong word. As you saw the first question when K80 server boards are involved on this developer forum is always if the server system was built to support the passive cooling solution, the required monitoring, BIOS, etc.

If you bought a full server system configuration from one vendor and the machine is not behaving like expected, then you should contact the system vendor first to determine if there isn’t any defect involved.

jeremyrutman · February 12, 2017, 5:02pm

I have an ‘encorr. ecc’ problem on my K80 that is preventing its use:

root@x:~# nvidia-smi
Sun Feb 12 11:00:53 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:83:00.0     Off |                    0 |
| N/A   66C    P0   105W / 149W |  10819MiB / 11439MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   53C    P0   145W / 149W |  10819MiB / 11439MiB |     85%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:87:00.0     Off |                    2 |
| N/A   43C    P8    29W / 149W |      2MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:88:00.0     Off |                    0 |
| N/A   54C    P0   151W / 149W |  10819MiB / 11439MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14252    C   python solve_jr.py 0                         10815MiB |
|    1      9625    C   python solve_jr.py 1                         10815MiB |
|    3     15555    C   python solve_jr.py 3                         10815MiB |
+-----------------------------------------------------------------------------+

Apparently some sort of errror has occurred on gpu #2, should i turn off ecc , reboot the machine, or what?