We’re happy to share the following project on GitHub which demonstrates object detection and image classification workflows using TensorRT integration in TensorFlow (for details on TF-TRT integration see this blog post). With this project you can easily accelerate popular models like SSD Inception V2 for use on Jetson.
By following the steps outlined in this project you will
Download pretrained object detection and image classification models sourced from the TensorFlow models repository
Run scripts to preprocess the TensorFlow graphs for best utilization of TensorRT and Jetson
Accelerate models using TensorRT integration in TensorFlow
Execute models with the TensorFlow Python API
The models are sourced from the TensorFlow models repository, so it is possible to train the models for custom tasks using the steps detailed there. Provided you use one of the listed model architectures, you can follow the steps above to easily accelerate the model for ideal performance on Jetson.
It’s easy to use! Thank you!
I try it on PC. Probably I failed in labeling.
My webcam shows car, but label says bicycle.
I tried ssd_inception_v2_coco_2017_11_17 and mscoco_label_map.pbtxt.
It shouldn’t be related to TensorRT, but in this case it seems the neural network output is 0-indexed, while the label map is 1-indexed. You should be able to add +1 to each output index of the network before associating with the label map to get the correct label.
Thank you for the clear explanation and benchmarking on this website, and for testing out different models, it is really appreciated!
According to your execution time table, I should get 54.4ms when running ssd_inception_v2_coco on the TX2. Over 200 runs, after the network is ‘warmed up’, I get 69.63ms. This seems a significant difference to me. When looking at tegra_stats, it seems that the GPU is not very efficiently utilized (even though it varies over time, it is rarely even close to 90%):
I just followed all the steps on the Github readme and the notebook, so any idea what could be the cause of this? I use Jetpack 3.3 and Tensorflow 1.10.
We collected the benchmark timings under the following configuration
(1) JetPack 3.2
(2) TensorFlow 1.8
(3) MAXN power mode (sudo nvpmodel -m0 )
(4) Jetson clocks enabled (sudo ~/jetson_clocks.sh)
(5) Runtime averaged over 50 calls to sess.run(…) on a static image. This excludes reading from disk and JPEGdecoding.
First, if when you profiled (3)-(5) are different from our configuration, this would cause a difference in the timing.
If they are consistent with our profiling, then perhaps it is a performance regression from JetPack 3.2 → 3.3, or TensorFlow 1.8 → 1.10, which we would want to investigate.
Thanks for the reponse! I have indeed run the nvpmodel -m0 command and jetsock_clocks.sh so (3) and (4) are the same. And just so there is no doubt about it, here is the code I used to make sure (5) is comparable:
scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]})
times = []
for i in range(200):
t0 = time()
scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]})
times.append(time()-t0)
print(np.mean(times))
So I would say my setup is comparable. Two other things I noticed:
Running graphdef.ParseFromString() on the frozen graph (generated with build_detection_graph) takes 4.7 seconds. Loading the trt_graph generated by trt.create_inference_graph takes 9 minutes and 26 seconds (!). Same with running tf.import_graph_def(graphdef, name=‘’) on both files: 12.9 seconds for the frozen graph, 41.8 seconds for the trt_graph. Is this anywhere near expected times? Because it seems ridiculously long to me and could be indicative for something not working right with these versions of JetPack and TensorFlow
tegra_stats reports near-constant 90-100% GPU usage when running the frozen graph (which is running with comparable speed to the speed you reported, 139ms vs your reported 132ms for ssd_inception_v2_coco 300x300)
I’ll see if I can get JetPack 3.2 with Tensorflow 1.8 installed and see if I can reproduce your speeds that way to make sure there is nothing else that goes wrong
Jetpack 3.2 with Tensorflow 1.8 is a little faster, but still not as fast as reported (note that this is a different TX2 module). With the same setup as before, I now get an average runtime of 64.72ms. 5ms quicker than with Tensorflow 1.10 and Jetpack 3.3 but still 10 ms short of your measured time. Is there something I’m still missing here?
Running ParseFromString to load the trt_graph now only takes 4.59 seconds, so that bug is gone at least.
The GPU usage still seems to be suboptimal, but maybe that is inherent in the model / working with a batch of 1. This is what tegrastats reports with --interval 100:
Sure! On the Jetpack 3.2 setup, now with the official Tensorflow 1.9, I get about the same running time I got earlier with the Jetpack 3.3 and Tensorflow 1.10 setup: an average of 69.65ms per image for ssd_inception_v2. I realize I should maybe have mentioned this last time, but this is the log for the creation of the inference graph (with the official TF1.9):
2018-09-10 09:01:07.543057: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2018-09-10 09:01:24.474770: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 7
2018-09-10 09:01:25.042883: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.043030: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes)
2018-09-10 09:01:25.048693: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.048845: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 812 nodes)
2018-09-10 09:01:25.883711: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:2 due to: "Invalid argument: Output node 'FeatureExtractor/InceptionV2/InceptionV2/Mixed_3b/concat-4-LayoutOptimizer' is weights not tensor" SKIPPING......( 844 nodes)
2018-09-10 09:01:25.890138: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:3 due to: "Unimplemented: Operation: GatherV2 does not support tensor input as indices, at: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/FilterGreaterThan_83/Gather/GatherV2" SKIPPING......( 91 nodes)
2018-09-10 09:01:25.894630: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.894759: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:4 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 180 nodes)
2018-09-10 09:01:25.898671: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.898789: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:5 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 93 nodes)
2018-09-10 09:01:25.902308: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.902452: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:6 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes)
Do you also want me to check the log or performance with Jetpack 3.3 or with TF 1.10/1.8?
I wrote a blog post about my experience using the NVIDIA-Jetson/tf_trt_models code. I also shared a script about how to do real-time object detection with various cameras or file inputs. Feel free to check it out. Do let me know if you have suggestions about the code. I’ll update my blog post and my GitHub repo as needed.
As for the performance discrepancy / low GPU utilization. This may have to do with how the object detection post-processing pipeline is configured.
It seems that the default box score threshold for the non-maximum suppression stage is 1e-8, which essentially considers any box a detection. This may result in unnecessary box-to-box comparisons and a heavier CPU load. This parameter may be found here
I believe the benchmarks in tf_trt_models were collected using a threshold of 0.3. Are your models using a very low threshold? If so could you try raising this to something larger (say above 0.1) and report the performance?
Wow, that matters a lot! With a threshold of 0.3, I get a running time of 41.3ms using Tensorflow 1.10 and TensorRT4 for the ssd_inception_v2 model, which is a lot faster than your reported time (maybe because I use a different image so the NMS has even less boxes to compare?) Anyway, thanks, I consider this solved :)
With the official Tensorflow 1.9 I get 113ms now; I don’t really know what’s wrong but it seems the graph optimization doesn’t work at all now. It doesn’t really matter, probably just some conflicting versions of TensorRT and Tensorflow on my side…
I used that wheel too. I use Jetpack 3.2.1 now, but according to the JetPack website Jetpack 3.2.1 is the same as JetPack 3.3 apart from newer CUDA and CuDNN versions, which I use the JetPack 3.3 versions of (tensorrt_4.0.2.0-1+cuda9.0_arm64.deb and libcudnn7-dev_7.1.5.14-1+cuda9.0_arm64.deb, which is TensorRT 4.0 GA I think).
Weird that it doesn’t work for you. The installation of Tensorflow 1.10 wheel was a bit of a hassle for me; I can’t seem to compile h5py which is a dependency for keras which is a dependency for tensorflow so I skipped that using pip --no-deps. And as reported earlier (and same as you report on your blog) parsing a network from string takes ~10 minutes using this version.
Thanks for your blog btw, I enjoy your clearly written articles, they have helped me much in the past!
Today I fell back to JetPack-3.2.1 (TensorRT 3.0 GA) and tested my scripts against the tensorflow 1.8.0 wheel (Box) as specified in tf_trt_models/README.md at master · NVIDIA-AI-IOT/tf_trt_models · GitHub. And it indeed worked better! After setting score_threshold to 0.3, I was able to get ssd_mobilenet_v1_coco to do real-time object detection at ~20fps, just as advertised by NVIDIA. In addition, the trt optimization process ran much faster (only took 1~2 minutes) under this configuration.
I’m going to experiment more and try finding a way to make it work equally well on JetPack-3.3.
Otherwise, it’d be ideal if NVIDIA people could re-build the tensorflow wheels and verify tf_trt_models code against JetPack-3.3.
I confirmed that the slowness on Jetson TX2 (loading ssd models, optimizing model with TensorRT, and loading optimized graph, etc.) has a lot to do with the version of tensorflow. My guess is that some recent changes in tensorflow do not work that well on aarch64 architecture.
Based on my testing, TF-TRT works great with tensorflow 1.8.0. I’ve tested it on both JetPack-3.2.1 and JetPack-3.3.
For more details, please read my blog post and the README.md in my GitHub repo.