nvidia-smi programmatically

kalman · April 13, 2012, 3:44pm

Hi all,
I need to retrieve programmatically some informations retrieved by nvidia-smi,
I prefer avoid to fork/exec an nvidia-smi run and or strace it and perform the same
ioctl calls. It seems that the struct cudaDeviceProp doesn’t contains all needed such as:
fan speed, temperature, gpu utilization and such.

Am I missing something?

mfatica · April 13, 2012, 3:45pm

You should take a look at NVML:

kalman · April 13, 2012, 5:53pm

Thank you, exactly what I was looking for.

kalman · May 4, 2012, 3:50pm

I did try today to use it but I’m getting a driver mismatch.

Calling the nvmlInit() indeed I get this error:

Error: API mismatch: the NVIDIA kernel module has version 295.41,

but this NVIDIA driver component has version 295.45. Please make

sure that the kernel module and all NVIDIA driver components

have the same version.

I’m using CUDA4.2 and on the download page the driver proposed is

indeed the one I have installed: 295.41 and even looking in the

ftp site: I have found 295.40 and 295.49 no trace of 295.45 needed

by NVML.

Any chance to have the NVML version for the current driver 295.41

or at least the driver 295.45 required by NVML?

Robert_Alexander · May 4, 2012, 8:41pm

Hey kalman,

I installed the r295 driver locally on a 64 bit machine, but I wasn’t able to reproduce the error.

[0 ralexander@ralexander-test:~]

$ nvidia-smi

Fri May  4 13:20:33 2012       

+------------------------------------------------------+                       

| NVIDIA-SMI 3.295.41   Driver Version: 295.41         |                       

|-------------------------------+----------------------+----------------------+

...

Can you run:

$ which nvidia-smi

It seems possible that an earlier version of nvidia-smi is on your path.

Can you also try reinstalling the r295.41 driver?

Thanks,

Robert Alexander

kalman · May 5, 2012, 7:00pm

Hey kalman,

I installed the r295 driver locally on a 64 bit machine, but I wasn’t able to reproduce the error.
[0 ralexander@ralexander-test:~]

$ nvidia-smi

Fri May  4 13:20:33 2012       

+------------------------------------------------------+                       

| NVIDIA-SMI 3.295.41   Driver Version: 295.41         |                       

|-------------------------------+----------------------+----------------------+

...
Can you run:

$ which nvidia-smi

It seems possible that an earlier version of nvidia-smi is on your path.

Can you also try reinstalling the r295.41 driver?

Thanks,

Robert Alexander

The problem arise not using the nvidia-smi. I’m trying to use the NVML, and the first

instruction to do is to call the nvmlInit() on my own application, the error I’m getting

is indeed not related to nvidia-smi (it’s working fine).

I did download tdk_2.295.1_linux.tar.gz from here http://developer.nvidia.com/nvidia-management-library-nvml

as suggested by mfatica (see post above), I did copy the header nvml.h and the library

libnvidia-ml.so present in the package (no installer?), I was able to correctly compile

(well apart a comma at the end of an enumerator list present in the nvml.h) and link correctly

but at runtime I get the error about driver mismatch, it seems that libnvidia-ml.so was built

against another driver (despite the fact the archive is named tdk_2.295.1).

Further investigating on it I did a ldd on nvidia-smi and it seems that in my system

there is already an libnvidia-ml.so (in /usr/lib), so I guess there was no need to use

the libraries inside the tdk_2.295.1_linux.tar.gz but only the header nvml.h is required

from that archive (why do not distribute it with cuda toolkit then?). Deleting the

libnvidia-ml.so copied from tdk now my application doesn’t complain anymore (at least

calling the nvmlInit).

While we are at it, running nvidia-smi --help I see the list of supported devices:

Supported products:

Tesla: S1070, S2050, C1060, C2050/70/75, M2050/70/75/90, X2070/90, Gemini

what is Gemini ?

Przemyslaw_Zych · May 7, 2012, 11:22am

Hi kalman,

Sorry for the confusion. You should always run with libnvida-ml.so that is installed with your NVIDIA Display Driver. By default it’s installed in /usr/lib and /usr/lib64.

libnvida-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn’t have to have Display Driver installed).

We’ll add a note to the README to make it more clear.

Since NVML is targeted at Tesla customers and CUDA Toolkit is big enough without it we’ve decided to keep it separate.

Regrads,
Przemyslaw Zych