TensorRT 5 docs and examples (Solved)

Is there any documentation available yet for TensorRT 5?

The JetPack-4.0-Developer-Preview doc refers to new Caffe SSD and YOLO samples, but they don’t appear to be in /usr/src/tensorrt/samples or on the Deep Learning SDK documentation page. I can’t find any release notes for TensorRT 5 either.

The JetPack-4.0 doc also mentions the ability to execute on either iGPU or DLA using TensorRT 5. Are there any examples available that demonstrate this?

The TensorRT docs are located in /usr/share/doc/tensorrt
The TensorRT samples are located at /usr/src/tensorrt

The UFF SSD sample is included, and for GPU/DLA the trtexec and several others have been updated with DLA support.
The dev branch of jetson-inference also contains DLA support.

The release notes for TensorRT are included in the JetPack Release Notes.

Yes, I had already found the TensorRT folder at /usr/share/doc/tensorrt however, I don’t see any actual documentation files. The html directory is empty except for another empty “search” directory. There’s a changelog.Debian.gz file, but it won’t extract with the usual tar command. What is supposed to be in that folder? Maybe that folder got corrupted somehow on my Xavier?

I also had seen the TensorRT samples at /usr/src/tensorrt/samples without realizing the samples have been updated for TensorRT 5 and DLA support. Thanks for pointing that out, and the jetson-inference support for DLA. I’ll have a look at those…

Hmm, if you go over to your host PC where you downloaded JetPack, go under the jetpack_download/ directory and extract the package libinfer-dev_5.0.0-1+cuda10.0_arm64.deb (it extracts from the GUI just like a zip archive would).

Then extract data.tar.xz and the TensorRT Developer Guide will be there under /usr/share/doc/tensorrt
Chapter 6 of the TensorRT Developer Guide covers DLA and will let you know what to look for in the source examples.

Nice, I now have the TensorRT 5 Developer Guide. Between that and the Jetson-inference dev repo, it ought to be enough to get me started with the DLA…

On page 37 of the developers guide it has an example of running an Alexnet model in the DLA, but that Alexnet model isn’t in the data directory.

Did I miss some step? It’s looking for AlexNet/AlexNet_N2.prototxt

same problem here, after installation, the /usr/share/doc/tensorrt is pretty much empty

find /usr/share/doc/tensorrt

/usr/share/doc/tensorrt
/usr/share/doc/tensorrt/graphics
/usr/share/doc/tensorrt/changelog.Debian.gz
/usr/share/doc/tensorrt/common
/usr/share/doc/tensorrt/common/graphics
/usr/share/doc/tensorrt/common/scripts
/usr/share/doc/tensorrt/common/scripts/google-analytics
/usr/share/doc/tensorrt/common/scripts/tynt
/usr/share/doc/tensorrt/common/formatting
/usr/share/doc/tensorrt/copyright
/usr/share/doc/tensorrt/html
/usr/share/doc/tensorrt/html/search

Sorry about that, it appears to be an issue with the libinfer-dev_5.0.0-1+cuda10.0_arm64.deb package, we will fix it in the next update.

You can use the Alexnet model that jetson-inference downloads, or get it from here: https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet

If you don’t want to enable GPU fallback, you should remove the ‘prob’ layer from the prototxt, this is downloaded by the jetson-inference dev branch (direct link to modified prototxt here).

Does this look correct? These numbers are with the default clock/power settings.

With DLA
nvidia@jetson-0423018055236:~/tensorrt/bin$ ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLA=1
deploy: /home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLA: 1
Input “data”: 3x227x227
Output “fc8”: 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 19.6518 ms (host walltime is 20.1839 ms, 99% percentile time is 24.7552).
Average over 10 runs is 19.1648 ms (host walltime is 19.8968 ms, 99% percentile time is 20.14).
Average over 10 runs is 19.7278 ms (host walltime is 20.2118 ms, 99% percentile time is 20.9213).
Average over 10 runs is 19.432 ms (host walltime is 20.2801 ms, 99% percentile time is 19.8216).
Average over 10 runs is 19.4548 ms (host walltime is 20.0524 ms, 99% percentile time is 19.9742).
Average over 10 runs is 19.5619 ms (host walltime is 20.1502 ms, 99% percentile time is 20.8323).
Average over 10 runs is 19.3835 ms (host walltime is 20.5315 ms, 99% percentile time is 19.5666).
Average over 10 runs is 14.5115 ms (host walltime is 14.9602 ms, 99% percentile time is 20.2947).
Average over 10 runs is 17.2961 ms (host walltime is 17.5563 ms, 99% percentile time is 19.1724).
Average over 10 runs is 19.5955 ms (host walltime is 19.9446 ms, 99% percentile time is 20.8753).

Without DLA
nvidia@jetson-0423018055236:~/tensorrt/bin$ ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet.prototxt --output=prob --fp16 --useDLA=0
deploy: /home/nvidia/tensorrt/data/AlexNet/alexnet.prototxt
output: prob
fp16
useDLA: 0
Input “data”: 3x227x227
Output “prob”: 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 11.621 ms (host walltime is 11.9936 ms, 99% percentile time is 15.8403).
Average over 10 runs is 6.79573 ms (host walltime is 6.99659 ms, 99% percentile time is 9.82736).
Average over 10 runs is 5.90683 ms (host walltime is 6.06322 ms, 99% percentile time is 9.10438).
Average over 10 runs is 5.90028 ms (host walltime is 5.99019 ms, 99% percentile time is 8.44902).
Average over 10 runs is 5.88509 ms (host walltime is 6.03842 ms, 99% percentile time is 8.50826).
Average over 10 runs is 5.79798 ms (host walltime is 5.93169 ms, 99% percentile time is 8.15715).
Average over 10 runs is 5.82621 ms (host walltime is 6.01957 ms, 99% percentile time is 8.75309).
Average over 10 runs is 5.89837 ms (host walltime is 5.99771 ms, 99% percentile time is 7.99946).
Average over 10 runs is 5.92456 ms (host walltime is 6.01304 ms, 99% percentile time is 9.00698).
Average over 10 runs is 5.6939 ms (host walltime is 5.89721 ms, 99% percentile time is 6.91085).

Hi S4WRXTTCS, I get similar times running in MODE_15W. And with GoogleNet_noprob.prototxt I get ~75FPS on DLA.

When I ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLA=1
It has this message “Parameter check failed at: …/builder/builder.cpp::setDefaultDeviceType::226, condition: mHwContext.hasDLA && static_cast(deviceType) <= mHwContext.nbDLAEngines”

Does this look correct?

Hi mingxian32, can you try running trtexec with sudo user privilege?

If you continue experiencing the issue, you may want to re-flash with JetPack and re-install L4T / TensorRT, as it appears there is some issue detecting DLA on your device.

Hi,

It is weird! I don’t get the same times as you but there is big diffrences. And I also run 15W mode.

FYI : useDLA was replaced by useDLACore.

Thanks

With DLA

$ bin/trtexec --deploy=data/alexnet/alexnet_noprob.prototxt --output=fc8 --useDLACore=1 --fp16

deploy: data/alexnet/alexnet_noprob.prototxt
output: fc8
useDLACore: 1
fp16
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 10.2751 ms (host walltime is 11.1669 ms, 99% percentile time is 16.472).
Average over 10 runs is 9.39058 ms (host walltime is 10.2195 ms, 99% percentile time is 9.66144).
Average over 10 runs is 9.77683 ms (host walltime is 10.5191 ms, 99% percentile time is 10.6946).
Average over 10 runs is 10.605 ms (host walltime is 11.045 ms, 99% percentile time is 12.1528).
Average over 10 runs is 10.2545 ms (host walltime is 10.6109 ms, 99% percentile time is 10.7653).
Average over 10 runs is 10.1848 ms (host walltime is 10.6146 ms, 99% percentile time is 11.2446).
Average over 10 runs is 10.1359 ms (host walltime is 10.7114 ms, 99% percentile time is 10.5277).
Average over 10 runs is 9.88428 ms (host walltime is 10.3071 ms, 99% percentile time is 10.3476).
Average over 10 runs is 10.2372 ms (host walltime is 10.5192 ms, 99% percentile time is 11.1892).
Average over 10 runs is 10.3196 ms (host walltime is 10.6513 ms, 99% percentile time is 10.7878).


Without DLA

$ bin/trtexec --deploy=data/alexnet/alexnet.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback

deploy: data/alexnet/alexnet.prototxt
output: prob
useDLACore: 0
fp16
allowGPUFallback
Input "data": 3x227x227
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 9.63429 ms (host walltime is 10.5536 ms, 99% percentile time is 11.1104).
Average over 10 runs is 9.78064 ms (host walltime is 10.4858 ms, 99% percentile time is 11.6869).
Average over 10 runs is 9.78679 ms (host walltime is 10.7028 ms, 99% percentile time is 10.5165).
Average over 10 runs is 9.56096 ms (host walltime is 10.2707 ms, 99% percentile time is 9.94714).
Average over 10 runs is 9.59396 ms (host walltime is 10.5046 ms, 99% percentile time is 10.2052).
Average over 10 runs is 9.85251 ms (host walltime is 10.7424 ms, 99% percentile time is 10.8503).
Average over 10 runs is 9.5877 ms (host walltime is 10.4543 ms, 99% percentile time is 9.82426).
Average over 10 runs is 9.70454 ms (host walltime is 10.5341 ms, 99% percentile time is 10.0946).
Average over 10 runs is 9.64803 ms (host walltime is 10.5432 ms, 99% percentile time is 10.5482).
Average over 10 runs is 10.0365 ms (host walltime is 10.4798 ms, 99% percentile time is 10.2881).

Hello,

I also want to share my results on wDLA and w/oDLA.

In my case, I configured the board on max performance mode (30W).

My results were also weird because two execution times are similar regardless of DLA possibility.

With DLA
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=1

deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 1
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.7926 ms (host walltime is 6.98827 ms, 99% percentile time is 11.3653).
Average over 10 runs is 6.27773 ms (host walltime is 6.415 ms, 99% percentile time is 6.33238).
Average over 10 runs is 6.34808 ms (host walltime is 6.51361 ms, 99% percentile time is 6.4911).
Average over 10 runs is 6.44476 ms (host walltime is 6.61664 ms, 99% percentile time is 6.55571).
Average over 10 runs is 6.46809 ms (host walltime is 6.63325 ms, 99% percentile time is 6.5311).
Average over 10 runs is 6.39661 ms (host walltime is 6.61909 ms, 99% percentile time is 6.51162).
Average over 10 runs is 6.3873 ms (host walltime is 6.92348 ms, 99% percentile time is 6.56896).
Average over 10 runs is 6.3655 ms (host walltime is 6.60092 ms, 99% percentile time is 6.47475).
Average over 10 runs is 6.42326 ms (host walltime is 6.57476 ms, 99% percentile time is 6.58432).
Average over 10 runs is 6.4514 ms (host walltime is 6.60459 ms, 99% percentile time is 6.53622).

Without DLA
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=0

deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 0
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.37521 ms (host walltime is 6.55121 ms, 99% percentile time is 7.07267).
Average over 10 runs is 6.43402 ms (host walltime is 6.61367 ms, 99% percentile time is 6.53213).
Average over 10 runs is 6.59314 ms (host walltime is 6.7674 ms, 99% percentile time is 6.8311).
Average over 10 runs is 6.46851 ms (host walltime is 6.63689 ms, 99% percentile time is 6.53619).
Average over 10 runs is 6.49032 ms (host walltime is 6.67388 ms, 99% percentile time is 6.56176).
Average over 10 runs is 6.50446 ms (host walltime is 6.66483 ms, 99% percentile time is 6.76147).
Average over 10 runs is 6.47384 ms (host walltime is 6.66675 ms, 99% percentile time is 6.61805).
Average over 10 runs is 6.45161 ms (host walltime is 6.63162 ms, 99% percentile time is 6.62525).
Average over 10 runs is 6.59969 ms (host walltime is 6.9295 ms, 99% percentile time is 8.73366).
Average over 10 runs is 6.48038 ms (host walltime is 6.63838 ms, 99% percentile time is 6.52291).

Hello, also sharing my results with VGG16 (from Keras applications, last layer is flatten)

with DLA (FP16): ~16.5 ms
without DLA (FP16): ~5.0 ms

I assume these results, because almost half of network is still running on GPU.

******************************
Layers running on DLA:.
block1_conv1/convolution, block1_conv1/BiasAdd, block1_conv1/Relu, block1_conv2/convolution, block1_conv2/BiasAdd, block1_conv2/Relu, block1_pool/MaxPool, block2_conv1/convolution, block2_conv1/BiasAdd, block2_conv1/Relu, block2_conv2/convolution, block2_conv2/BiasAdd, block2_conv2/Relu, b
lock2_pool/MaxPool, block3_conv1/convolution, block3_conv1/BiasAdd, block3_conv1/Relu, block3_conv2/convolution, block3_conv2/BiasAdd, block3_conv2/Relu, block3_conv3/convolution, block3_conv3/BiasAdd, block3_conv3/Relu, block3_pool/MaxPool, block4_conv1/convolution, block4_conv1/BiasAdd,.
block4_conv1/Relu, block4_conv2/convolution, block4_conv2/BiasAdd, block4_conv2/Relu, block4_conv3/convolution, block4_conv3/BiasAdd, block4_conv3/Relu, block4_pool/MaxPool, block5_conv1/convolution, block5_conv1/BiasAdd, block5_conv1/Relu, block5_conv2/convolution, block5_conv2/BiasAdd, b
lock5_conv2/Relu, block5_conv3/convolution, block5_conv3/BiasAdd, block5_conv3/Relu, block5_pool/MaxPool,.
******************************

******************************
Layers running on GPU:.
block1_conv1/kernel, block1_conv1/bias, block1_conv2/kernel, block1_conv2/bias, block2_conv1/kernel, block2_conv1/bias, block2_conv2/kernel, bloc
k2_conv2/bias, block3_conv1/kernel, block3_conv1/bias, block3_conv2/kernel, block3_conv2/bias, block3_conv3/kernel, block3_conv3/bias, block4_con
v1/kernel, block4_conv1/bias, block4_conv2/kernel, block4_conv2/bias, block4_conv3/kernel, block4_conv3/bias, block5_conv1/kernel, block5_conv1/b
ias, block5_conv2/kernel, block5_conv2/bias, block5_conv3/kernel, block5_conv3/bias, reshape_1/strided_slice/stack, reshape_1/strided_slice/stack
_1, reshape_1/Reshape/shape/1, (Unnamed Layer* 73) [Shuffle], reshape_1/Reshape,.
******************************