Object Detection Performance Jetson Tx2 slower than expected

Hey Developers,

i am currently running several Object Detection APIs on the Jetson Tx2 to figure out which is Realtime-Detection able.

Two examples are Googles API with Tensorflow (https://github.com/tensorflow/models/tree/master/research/object_detection)
I changed it a little bit to run it as a python script with onboard or webcam as input.
and Yolo on Darknet (YOLO: Real-Time Object Detection)

I speed up the jetson with:

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

and my Performances are:
Tensorflow with SSD_Mobilenet: 4 Fps
Darknet with Tiny-Yolo: 17.5 Fps
Farknet with Yolo-v2: 2.7 Fps

tegrastats gives me:

RAM 4393/7851MB (lfb 356x4MB) CPU [43%@2035,25%@2035,15%@2035,38%@2035,40%@2035,40%@2035] BCPU@35C MCPU@35C GPU@41C PLL@35C AO@35.5C Tboard@28C Tdiode@34.5C PMIC@100C thermal@34.7C VDD_IN 12282/12517 VDD_CPU 2059/2064 VDD_GPU 4727/4803 VDD_SOC 1601/1595 VDD_WIFI 0/69 VDD_DDR 2812/2808

Tensorflow gives me:

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB <b>freeMemory: 2.00GiB</b>
2017-12-20 10:16:28.963403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

the “freeMemory” value varies up to 4GiB, but it is never more than that, what does that value mean? Why is it so little? How can i free more memory and assign it to the object detection task?

Those Fps are not really slow, but fast is something different. So how is it possible that the Jetson is used in autonomous cars? I was expecting much more speed. My Dell Laptop with a Nvidia GTX 1050 is twice as fast on these test scenarios.

So am i doing something wrong? How can i increase Performance in terms of Fps?

Thank you in advance!

Hi,

Do you run TensorFlow with config.gpu_options.per_process_gpu_memory_fraction = xx?
This configuration will limit the allocated amount of GPU memory. You can get more information here:

Here are two suggestions for object detection sample:
1. DetectNet with Jetson_inference:
https://github.com/dusty-nv/jetson-inference#locating-object-coordinates-using-detectnet
2. Backend sample in Tegra Multimedia API

Thanks.

I got the same situation, run tensorflow inference with ssd_mobilenet_v1 model provided by google, I only got 4 fps on video, anyone got any idea how to improve the inference speed?

@D_pz i am currently working on Jetson Tx2 with Googles Object Detection API.
I created a github repo to work with it.
Should work for you too. Would be nice if you try it out or contribute!

@AastaLLL
no i don’t run tensorflow with this config, where should this be included?

I ran the Tensorflow object detection API and get following oupt of

sudo ./tegrastats

:

RAM 7565/7851MB (lfb 5x4MB) CPU [46%@2025,20%@2035,12%@2034,44%@2029,45%@2031,45%@2028] EMC_FREQ 5%@1866 GR3D_FREQ 6%@1300 APE 150 MTS fg 0% bg 0% BCPU@34.5C MCPU@34.5C GPU@40.5C PLL@34.5C AO@32C Tboard@29C Tdiode@32.25C PMIC@100C thermal@33.7C VDD_IN 6342/4735 VDD_CPU 2063/1405 VDD_GPU 1069/368 VDD_SOC 992/934 VDD_WIFI 19/42 VDD_DDR 1514/1316

It seems that the whole RAM is used, which is good. But the CPU Usage is only between around 10 and 50%

and the biggest Problem is: the GPU Usage is only at 6%

Do you know how i can increase the GPU Usage?
I think this is why i only get around 5fps on detecting objects with SSD Mobilenet.

The problem with detectnet is it’s not really intended for multiple objects. Sure you can do 2 or maybe 3, but I haven’t seen anything past that.

If a person needs to pick an Object detect network to detect multiple objects with a Jetson TX2 then what do they pick?

Assuming they want something reasonably fast (approx 15fps) with a reasonable resolution (640x480)?

I don’t see anything within the NVidia Digits → NVidia TX2 workflow that’s really meant for it.

In the list of things to try out there is an SSD, or Faster R-CNN. But, neither of those have been shown to operate faster than 5fps on the TX2. At least to my knowledge.

There is Yolo, but it’s my understanding one is giving up on accuracy.

Hi all,

Here are some suggestions:

1. We recommend TensorFlow user to use our TensorRT for fully utilizing hardware resource:

2. We have a tutorial for multi-class detection with DetectNet:
[url]GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.

Thanks.

Here is some update about object detection API of TensorFlow:

From this comment:
[url]Very slow Postprocessing in Object Detection API · Issue #2710 · tensorflow/models · GitHub
Some layer in the object detection API is still in CPU mode and this explains why the performance is not good on Jetson.

Thanks.

This is interesting, thank you AastaLLL for investigating.
But for my understanding this can’t be the only reason, because i updated the config of the tf.session() of my code to let it allow GPU Memory Growth.

While the performance stays the same, the Model only uses around 300MB of Ram and the GPU and CPU usage is still at the same lvl as before.

This is what makes me wonder, neither the GPU Memory, nor the GPU Freq, nor the CPU is maxed out at any time.

So where is the bottleneck? Why doesn’t the jetson just use more of its power?

Any ideas on that?

just saw your reply… I will try it, thanks a lot

Hi,
We found the performance issue comes from a TensorFlow operation called tf.where.

This is a control flow operation and has poor performance on GPU.

We are checking if there is any available workaround to improve this issue.
Will update information with you once we have.

Thanks.

Here are some updates:

Performance becomes better if put CNN on GPU and MAP on CPU.
It takes around 70ms on TX2 with maximized frequency.

Thanks.

That sounds good AastaLLL. Could you provide any details to how this is achieved? I.e. how to put the MAP on the CPU instead of GPU.

would be nice if you share how to achieve this @AastaLLL!

Hi,

We are preparing the script to share with you.
In short, we modify the .pb parser and create two networks: one for GPU and the other for CPU.

Thanks.

Hi AastaLL can you please let us know if the script is ready?

Check this issue:

Thanks.

Thanks, @AastaLLL, that worked out!

Hi Everyone, does anyone know how to increase YOLO FPS on Tx2? When I ran YOLO v2 on my laptop I was able to achieve about 25 FPS but when I am running it on my Tx2 I can only achieve 6-7 FPS. can anyone explain why there is so much difference?

What is the GPU frequency and number of CUDA cores and architecture of the GPU in your laptop?
How does that compare to the TX2 specs?
Also note that the TX2 is aimed at 12 Watts total across CPU + GPU (give or take,) which is probably much less than your laptop is using.