Can't detect the dGPU utilization through tegrastat

chenghul · June 4, 2018, 6:07pm

Hi,

Right now I’m working on the PX2 platform. I would like to confirm the GPU runtime utilization. From this forums, I know that I can use “sudo tegrastat” to check the GPU utlization. The “GR3D_FREQ 0%@1275” means the iGPU, and the “GR3D_PCI 0%@2” means the dGPU. But when I measured the GPU utilization by using tegrastate, I could only see the percentage of iGPU. The percentage of dGPU was always 0. Was there something wrong in my test scenario? The followings are the command I used to test.

download NVIDIA_CUDA-9.0_Samples
test iGPU

run "sudo tegrastats" in one terminal
run ./matrixMul -device=1 in the /NVIDIA_CUDA-9.0_Samples/0_Simple/matrixMul folder in another terminal
get the result "RAM 1456/6668MB (lfb 1079x4MB) CPU [0%@1981,61%@2031,42%@2033,0%@1980,0%@1979,1%@1981] EMC_FREQ 13%@1600 GR3D_FREQ 99%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@45C MCPU@45C Tegra@0C Tdiode@50.25C AO@45C GPU@51C BCPU@45C thermal@50.25C Tegra@50.25C Tj@50.25C"

test dGPU

run "sudo tegrastats" in one terminal
run ./matrixMul -device=0 in the /NVIDIA_CUDA-9.0_Samples/0_Simple/matrixMul folder in another terminal
get the result "RAM 1460/6668MB (lfb 1080x4MB) CPU [0%@1982,4%@2031,98%@2030,0%@1981,0%@1980,1%@1980] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@43C MCPU@43C Tegra@0C Tdiode@46C AO@41C GPU@49C BCPU@43C thermal@46C Tegra@46C Tj@46C"

Please help. Thanks!

SivaRamaKrishnaNV · June 5, 2018, 3:26am

Dear chenghul,
We are looking into this issue and get back to you.

ShaneCCC · July 5, 2018, 2:28am

Attached the new version tegrastats here.
Remove the .txt to run it.
tegrastats.txt (66 KB)

chenghul · July 23, 2018, 9:54pm

Hi ShaneCCC,

Thanks for your help. But the result is the same when I try to use the same command to test. Is there anything I missed? Or could you provide the way you test?

Thanks,
Krammer

ShaneCCC · July 24, 2018, 2:46am

Did you launch tegrastats by supervisor mode?
Please try sudo ./tegrastats

chenghul · July 24, 2018, 4:53am

Yes, I launched with sudo.

stillrunning · July 24, 2018, 6:55am

Hi ShaneCCC,

when i used yours and previous version for comparing both, dGPU(id=0) was processing something by caffe.
The results are as follows.

Your tegrastats
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1997,23%@2035,77%@2034,0%@1996,0%@1996,0%@1995] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2573 PLL@43.5C MCPU@43.5C Tegra@0C Tdiode@48.5C AO@43.5C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.5C Tj@48.5C
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1970,80%@2035,20%@2035,0%@1964,0%@1965,1%@1964] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2581 PLL@44C MCPU@44C Tegra@0C Tdiode@48.25C AO@43.5C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.25C Tj@48.25C
RAM 2210/6668MB (lfb 773x4MB) CPU [0%@1950,65%@2015,34%@2018,0%@1947,0%@1948,0%@1949] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2573 PLL@43.5C MCPU@43.5C Tegra@0C Tdiode@48.5C AO@43C GPU@49.5C BCPU@43.5C thermal@48.5C Tegra@48.5C Tj@48.5C

Previous tegrastats
RAM 2213/6668MB (lfb 773x4MB) CPU [0%@1997,0%@2035,0%@2034,0%@1996,0%@1996,0%@1995] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47.25C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.75C Tegra@47.25C Tj@47.25C
RAM 2214/6668MB (lfb 773x4MB) CPU [0%@1998,49%@2034,51%@2035,0%@1996,0%@1997,0%@1996] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 1% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47.25C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.25C Tegra@47.25C Tj@47.25C
RAM 2213/6668MB (lfb 773x4MB) CPU [0%@1966,80%@2034,20%@2035,0%@1966,0%@1964,0%@1968] EMC_FREQ 1%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@42.5C MCPU@42.5C Tegra@0C Tdiode@47C AO@42.5C GPU@48.5C BCPU@42.5C thermal@47.25C Tegra@47C Tj@47C

It’ just different to showing that dGPU memort clock.
am I right?

Actually, I need to check gpu memory uasge like nvidia-smi.

AastaLLL · July 24, 2018, 7:32am

Hi,

You can check CUDA memory with cudaMemGetInfo().
Here is an example for your reference:
[url]https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/post/5168834/#5168834[/url]

Thanks.

stillrunning · July 25, 2018, 7:42am

Thanks AastaLLL so much.

I’ve done a memory test based on the code that you gave me.
The results that by your code and by nvidia-smi are the same on windows(VS2015, x64) and ubuntu(GPU server, 14.04, x64).

that code seems to work well too for Drive PX2’s GPU 0.
Becuse when I tested the same deeplearning program on the GPU server and DrivePX2 pascal GPU(GPU 0), the GPU memory usage was almost the same(±50MB).

However, the results of the code are different for the Paker GPU(GPU 1 on drive PX2) “sudo tegrastats” in DrivePX2(DriveInstall_5.0.5.0bL_SDK_b3) and Jetson TX2 (JetPack 3.1).

[Your code in Drive PX2(GPU 1, Paker)]
GPU memory usage: used = 2889.60 MB, free = 3777.98 MB, total = 6667.57 MB

[sudo tegrastats in Drive PX2(GPU 1, Paker)]
RAM 1783/6668MB (lfb 730x4MB) CPU [1%@1965,0%@2034,0%@2036,0%@1965,0%@1964,0%@1964] EMC_FREQ 0%@1600 GR3D_FREQ 0%@1275 APE 245 MTS fg 0% bg 0% GR3D_PCI 0%@2 PLL@46C MCPU@46C Tegra@0C Tdiode@50.75C AO@46C GPU@52C BCPU@46C thermal@51.5C Tegra@50.75C Tj@50.75C

Could you tell me why?

nunovxax9 · July 25, 2018, 7:52pm

Is there a good way to actually know the percentage of iGPU and dGPU computing usage (like nvidia-settings or nvidia-smi on the x86_64) ?

stillrunning · July 26, 2018, 9:23am

Hi nunovxax9,

I needed that function too, so I had tried several things.
From the conclusion, it is impossible now.

Using the nvmlDeviceGetUtilizationRates() of the NVML API, you can get the GPU Utilization rate at the code level, but According to the link below, that is not available in the Tegra series.
https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference
(The output of nvmlDeviceGetUtilizationRates() is the same as “Volatile GPU-Util” of “nvidia-smi”.)

There is no way to check the utilization rate of dGPU untill supporting it by nvidia developers.
However, iGPU is checked by the value of GR3D_FREQ when you use “sudo tegrastats”.

AastaLLL · July 27, 2018, 6:20am

Hi,

For dGPU, you can get the current clock information via

sudo cat /sys/kernel/debug/gpu_pci/clocks/gpc2clk

And the gpu utilization percentage via

cat /sys/bus/pci/drivers/nvgpu/[dynamic ID]/load

Currently, there is something incorrect in the ‘load’ node and it always report 0.
We are checking this with core team internally. Will update information with you later.

Thanks.

stillrunning · July 27, 2018, 7:49am

Thank you AastaLLL,
I hope that this issue will be resolved soon.
Have a good day.

christoph.doerr · September 13, 2018, 8:02am

Hi,
have you solved that problem and is there solution for python to monitor the gpu load dynamically?
Thank you!

AastaLLL · September 14, 2018, 3:05am

Hi,

Thanks for your patience.
We are still working on this issue.

Will update information once we have further information.
Thanks.

mazenfakhrfakhr · September 24, 2018, 2:02am

There is no way to check the utilization rate of dGPU untill supporting it by nvidia developers.
However, iGPU is checked by the value of GR3D_FREQ when you use “sudo tegrastats”
FetLife IMVU Canva

dariusz.filipski · January 11, 2019, 12:12pm

Any update, please? I tried running matmul of large matrices through TensorFlow and I get 0% load on both:

sudo ./tegrastats - on version linked in https://devtalk.nvidia.com/default/topic/1036238/general/can-t-detect-the-dgpu-utilization-through-tegrastat/post/5269344/#5269344 : GR3D_PCI 0%@2581
watch -n 0.5 cat /sys/bus/pci/drivers/nvgpu/0000\:04\:00.0/load
Every 0.5s: cat /sys/bus/pci/drivers/nvgpu/0000:04:00.0/load Fri Jan 11 13:08:56 2019

0

Tegrastats shows 99% when running the same code on iGPU, so there’s definitely something wrong with reporting the load from dGPU

VickNV · January 17, 2019, 7:12am

Fix for the tegrastats issue will be included in the upcoming drive os release. Thanks!

VickNV · January 17, 2019, 7:12am

Fix for the tegrastats issue will be included in the upcoming drive os release. Thanks!

dariusz.filipski · January 17, 2019, 10:12am

That’s a great news, thanks! Any ETA when this going to be released?

Can one update just tegrastats (and necessary dependencies, if any) on existing Drive PX 2 installation? I’m not keen on reflashing the device again, we have all the tools and environment set up there and it takes quite some time to rebuild it :(

Are you going even to release new Drive OS for Drive PX 2?