nvidia 387.12 breaks power reading in nvidia-smi.

hussam · October 4, 2017, 8:19am

the relevant line says:
| 45% 40C P0 ERR! / 75W | 162MiB / 4038MiB | 0% Default |

01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
linux kernel 4.9.52

Downgrading to 384.90 fixes it.
nvidia-bug-report.log.gz (117 KB)

hussam · October 4, 2017, 8:36am

The correct output under 384.90:

| 45%   40C    P0    35W /  75W |    191MiB /  4038MiB |      0%      Default |

sL1pKn07 · October 4, 2017, 6:10pm

works for me with Titan Black

|   0  GeForce GTX TIT...  Off  | 00000000:13:00.0  On |                  N/A |
| 39%   57C    P8    19W / 300W |   1472MiB /  6066MiB |      0%      Default |

hussam · October 4, 2017, 6:19pm

I see. I’ll wait for input from NVIDIA developers. They likely know better because it works for sure on 384.90 and previous drivers.

aplattner · October 9, 2017, 8:22pm

I looked into this a bit to track down the history of what happened. Apparently the power measurement circuitry on this particular GPU board isn’t very accurate, so reporting was disabled in nvidia-smi. However, there was a bug in the way it was disabled that causes it to report “ERR!” instead of “N/A”. A future driver should change the reporting to “N/A”.

hussam · October 9, 2017, 8:38pm

So the 35 and 36 readings I am seeing in 384.90 are not correct? It fluctuates between those two values depending on GPU utilization load.
And yes, the “ERR!” was what scared me.

Apart from that, is this something safe to just ignore or are there any side effects?
Thank you for the reply.

hussam · October 10, 2017, 5:32am

Is the temperature sensor support guaranteed to stay in future driver versions for this card?
It would be disastrous to lose that one as it affects power management.
Sorry for the odd question, but I did pay 203 US dollars (including taxes) to get correctly imported NVIDIA card so I would like to know what will continue to work on the long run.
Thank you.

gerhard.hintermayer · October 18, 2017, 8:50am

To me it looks like power limit handling has been broken too. I recently upgraded from 384.90 and since then my miner can’t stress my GTX 1070 more that ~ 102W - no matter how high I set the power limit :-( In the last version of the driver this was definitely possible. (of course I did not do any changes in the miner config)

regards

hussam · October 18, 2017, 9:00am

What card manufacturer?

gerhard.hintermayer · October 18, 2017, 9:17am

Gigabyte Geforce GTX1070 Windforce OC

BTW. this is only true for the GTX1070 card, the GTX1050Ti still drains all the power it should (can confirm this from the temperature readings, as the actual power reading shows ERR).

EDIT: should I start a new thread with that topic ?

hussam · October 18, 2017, 9:29am

I don’t know. But if I understand from aplattner’s post that the readings are incorrect, maybe NVIDIA is capping power usage to safe values so the cards don’t blow up?

gerhard.hintermayer · October 18, 2017, 10:43am

This are my readings:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.12                 Driver Version: 387.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
| 41%   60C    P2   100W / 105W |    605MiB /  8113MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 105...  On   | 00000000:06:00.0 Off |                  N/A |
| 54%   69C    P0   ERR! /  52W |   2295MiB /  4038MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12906      C   ./miner                                      595MiB |
|    1     12852      C   ./ethdcrminer64                             2285MiB |
+-----------------------------------------------------------------------------+

The 1070 uses only 100-102W, no matter which value I enter for CAP. I also have a power meter on the outlet for this machine, and the readings are correlating, also temp of card does not rise, when setting power to e.h. 150W, which definitely did in previous versions.

Gerhard

hussam · October 18, 2017, 10:56am

So the Cap value isn’t hardcoded by the board manufacturer?

gerhard.hintermayer · October 18, 2017, 11:06am

No, you can set it with

nvidia-smi -pl <desired watts>

in persistent mode

gerhard.hintermayer · October 18, 2017, 11:46am

must be related to the P2 power state, the GTX1050Ti is in P0, but the GTX1070 is in P2 :-( no way to force that with nvidia-smi :-(

hussam · October 18, 2017, 11:56am

nvidia-smi says P0 on my 1050 ti while nvidia-settings says P2.
Now it says P5 and is stuck there regardless of clock speed according to nvidia-smi -q.
Edit: And now it went back to P0. There seems to be some delay.

gerhard.hintermayer · October 18, 2017, 9:27pm

Just downgraded to 384.90 and all is good again. Both reading of power usage @ GTX1050Ti and actual power usage/power limit is working again.
People that are using their card for mining should definitely stick to 384.90 ! zecminer ~ 370 Sol/s @ 387.12 @ ~ 102W (not more possible) and ~ 405 Sol/s @ 384.90 @ 110W

BTW power state of 1070 is still P2, but raising power limit also raises power usage. Nvidia must have broken something with the latest driver version :-(

ed Oct 18 23:16:54 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
| 42%   61C    P2   108W / 110W |    605MiB /  8114MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 105...  On   | 00000000:06:00.0 Off |                  N/A |
| 53%   69C    P0    52W /  52W |   2301MiB /  4038MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     32344      C   ./miner                                      595MiB |
|    1     32375      C   ./ethdcrminer64                             2291MiB |
+-----------------------------------------------------------------------------+

hussam · October 19, 2017, 5:29am

I don’t do mining and my load is relatively low so I don’t care for the power cap.

In any case, I can’t downgrade to 384.90 because 287.12 works much better with the Gnome/mutter monitor-manager changes in gnome 3.26.

I would be happy marking this thread as ‘Fixed’ if they simply changed the “Err” and “unknown error” messages to “N/A” assuming there are no other side effects of doing so.

shiina_yndrd · November 10, 2017, 1:40am

I have two 1050ti cards and both show the same error (Pwr:Usage → ERR!).
It only occur in only case of using 1050ti cards.
Maybe some bugs exist.

I use NVidia driver ver.384.98 and ubuntu16.

hussam · November 10, 2017, 5:20am

I think something is actually wrong with those GPU boards in particular.

nvidia-smi -q | grep "Min Power"
Min Power Limit             : 52.50 W

But the actual power draw reading was always below 52.5