Shell script & "nvidia-smi" - needs right command/flag!

Hi

I’m not sure where to place my question. If this is very wrong, please forgive me…

I’ve got a problem regarding a shell-script and the “nvidia-smi” command!

I’ve made a script that as protection against CPU overheating on my Ubuntu Server 14.04.2. The scripts works nicely but I need to make it work on my 4 GPU’s as well.
I’m pretty green when it comes to bash scripts so I’ve been looking for commands which would make it easy for me to edit the script. I found and tested a lot of them, but none seems to give me the output I need! I’ll show you the commands and the output below. And the scripts as well.
What I need is a command which lists the GPU’s the same way the “sensors” command from “lm-sensors” does. So that I can use “grep” to select a GPU and set the variable “newstring” (the temp. two digits). I’ve been trying for a couple of days, but have had no luck. Mostly because the command “nvidia-smi -lso” and/or “nvidia-smi -lsa” doesn’t exist anymore. Think it was an experimental command.

Here’s the commands I found and tested & the output:

This command shows GPU socket number which I could put into the string “str” but the problem is that the temp. is on the next line. I’ve been fiddling with the flag “A 1” but haven’t been able to put it into the script:

# nvidia-smi -q -d temperature | grep GPU
Attached GPUs                       : 4
GPU 0000:01:00.0
        GPU Current Temp            : 57 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:02:00.0
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:03:00.0
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:04:00.0
        GPU Current Temp            : 48 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A

This command shows the temp in the first line, but there’s no GPU number!?

# nvidia-smi -q -d temperature | grep "GPU Current Temp"
        GPU Current Temp            : 58 C
        GPU Current Temp            : 47 C
        GPU Current Temp            : 47 C
        GPU Current Temp            : 48 C

This command shows the GPU number you select, but there’s still no output showing the GPU numer/socket/ID!?

# nvidia-smi -q --gpu=0 | grep "GPU Current Temp"
GPU Current Temp            : 59 C

And this commands shows the GPU number and the results in the same row!! But, no temperature!!

# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-9785c7c7-732f-1f51-..........)
GPU 1: GeForce GTX 750 (UUID: GPU-b2b1a4a-4dca-0c7f-..........)
GPU 2: GeForce GTX 750 (UUID: GPU-5e6b8efd-7531-777c-..........)
GPU 3: GeForce GTX 750 Ti (UUID: GPU-5b2b1a2f-3635-2a1c-..........)

And a command which shows all 4 GPU’s temp. without anything else. But still I need the GPU number/socket/ID!?

# nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
58
47
47
48

What I’m wishing for! If I could get a command which made a output like this I would be the happiest guy around:

GPU 0: GeForce GTX 750 Ti   GPU Current Temp            : 58 C
GPU 1: GeForce GTX 750   GPU Current Temp            : 47 C
GPU 2: GeForce GTX 750   GPU Current Temp            : 47 C
GPU 3: GeForce GTX 750 Ti   GPU Current Temp            : 48 C

Here’s the output that “sensors” from “lm-sensors”. As you can see the unit info and the temp is in the same line:

# -----------------------------------------------------------
# coretemp-isa-0000
# Adapter: ISA adapter
# Physical id 0:  +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 0:         +56.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 1:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 2:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
# Core 3:         +52.0°C  (high = +80.0°C, crit = +100.0°C)
# -----------------------------------------------------------

Here’s the part of the script that needs changing. As mentioned in the top, this works using the command “sensors” from the application “lm-sensors”. “lm-sensors” doesn’t show GPU temp. when running CUDA and the driver attached, so we need another command to get the GPU’s listed and the temp. shown. You may know another way to fix my problem, if please don’t hesitate to show me.:

[...]
echo "JOB RUN AT $(date)"
echo "======================================="

echo ''
echo 'CPU Warning Limit set to => '$1
echo 'CPU Shutdown Limit set to => '$2
echo ''
echo ''

sensors

echo ''
echo ''

for i in 0 1 2 3
do

  str=$(sensors | grep "Core $i:")
  newstr=${str:17:2}

  if [ ${newstr} -ge $1 ]
  then
    echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log
    echo $(date)                                                                        >>/home/......../logs/watchdogcputemp.log
    echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
    echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/......../logs/watchdogcputemp.log
    echo ' ACTION : EMAIL SENT'                                                         >>/home/......../logs/watchdogcputemp.log
    echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
    echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log

# Status Warning Email Sending Code
# WatchdogCpuTemp Alert! Status Warning - Notifying!"

/usr/bin/msmtp -d --read-recipients </home/......../shellscripts/messages/watchdogcputempwarning.txt

    echo 'Email Sent.....'
  fi
[...]

I hope there’s a bash-script guru out there, ready to solve this issue
Have a nice weekend!

Kind Regards,
Dan Hansen
Denmark

.

The user interface for nvidia-smi unfortunately needs much to be desired.

The linkage between the two sets of information would appear to be the UUID, which is a unique and unambiguous identifier for each GPU. First invoke nvidia-smi -L which gives GPU number, name, and UUID. Then invoke nvidia-smi -q and parse the output to find an UUID entry and from there scan until you hit the corresponding temperature data.

This is probably a job better tackled with a scripting language more powerful than bash, such as Perl. With Perl you could simply index a small array containing the per-GPU data directly with the UUID string.

For a crude approximation to what you want, try this:

nvidia-smi --query-gpu=index,name,temperature.gpu --format=csv,noheader

You can access the underlying NVML API directly using the python bindings found here:

https://pythonhosted.org/nvidia-ml-py/

And here are the NVML docs:

http://docs.nvidia.com/deploy/nvml-api/index.html

I find this quite a bit easier than trying to wrangle nvidia-smi (and more efficient than forking off processes to parse stdout).

Hi Njuffa ;)

Thanks for your repply!!!

I’m pretty new on Linx (1.5 years) and not familiar with perl. I’ve only used a perlscript once to correct an error when making a Ubuntu Server installation from an USB-stick.

I tried your suggestion and it’s the best so far! It’s almost there! I need some kind of text in front of the GPU ID such as “GPU” e.g. “GPU0” to be able to “grep” the line in my script. Maybe you now a little trick? But it’s the best suggestion so far, that’s for sure ;)

Here’s how it looked:

# nvidia-smi --query-gpu=index,name,temperature.gpu --format=csv,noheader
0, GeForce GTX 750 Ti, 52
1, GeForce GTX 750, 45
2, GeForce GTX 750, 49
3, GeForce GTX 750 Ti, 51

Very nice indeed!!

Looking forward to hear from you again - and hoping ;)

Hi ScottGray,

Thanks for your reply as well ;)

I’ve studied the 2 links and I’ve saved the links for further use. But, it’s to advanced for me. I’m not that “strong” yet. I tried to figure out the stuff at nvidia’s but I better stick with what I now for now. But thanks for trying helping me out ;)

Kind Regards,
Dan

Hi,

Problem seems to be solved for the moment! I’ve got a response from ubuntu forum and one suggestion solved the issue.

For others to use, here’s how we did it and the way we came to the solution. My thanks to “Terdon”:
http://askubuntu.com/questions/638665/shell-script-nvidia-smi-needs-right-command-flag/641828#641828

For others to see I’ll and learn of this here’s the results on my Ubuntu Server 14.04

This one looks like this on my system:

# nvidia-smi -q -d temperature | awk '{if(/C$/){print last,$0};last=$0};'
    Temperature         GPU Current Temp            : 53 C
    Temperature         GPU Current Temp            : 45 C
    Temperature         GPU Current Temp            : 52 C
    Temperature         GPU Current Temp            : 51 C

And this one, which is just PERFECT looks like this on my system:

# nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
GPU 0000:01:00.0        GPU Current Temp            : 53 C
GPU 0000:02:00.0        GPU Current Temp            : 45 C
GPU 0000:03:00.0        GPU Current Temp            : 52 C
GPU 0000:04:00.0        GPU Current Temp            : 51 C

Here I’ve got the GPU text to “grep” in my script. I’ve got the GPU socket ID and last but not least I’ve got the temperature in the same line! Exactly what I asked for.