TensorRT 5 docs and examples (Solved)

mechadeck · September 18, 2018, 9:17pm

Is there any documentation available yet for TensorRT 5?

The JetPack-4.0-Developer-Preview doc refers to new Caffe SSD and YOLO samples, but they don’t appear to be in /usr/src/tensorrt/samples or on the Deep Learning SDK documentation page. I can’t find any release notes for TensorRT 5 either.

The JetPack-4.0 doc also mentions the ability to execute on either iGPU or DLA using TensorRT 5. Are there any examples available that demonstrate this?

dusty_nv · September 18, 2018, 9:30pm

The TensorRT docs are located in /usr/share/doc/tensorrt
The TensorRT samples are located at /usr/src/tensorrt

The UFF SSD sample is included, and for GPU/DLA the trtexec and several others have been updated with DLA support.
The dev branch of jetson-inference also contains DLA support.

The release notes for TensorRT are included in the JetPack Release Notes.

mechadeck · September 18, 2018, 10:03pm

Yes, I had already found the TensorRT folder at /usr/share/doc/tensorrt however, I don’t see any actual documentation files. The html directory is empty except for another empty “search” directory. There’s a changelog.Debian.gz file, but it won’t extract with the usual tar command. What is supposed to be in that folder? Maybe that folder got corrupted somehow on my Xavier?

I also had seen the TensorRT samples at /usr/src/tensorrt/samples without realizing the samples have been updated for TensorRT 5 and DLA support. Thanks for pointing that out, and the jetson-inference support for DLA. I’ll have a look at those…

dusty_nv · September 19, 2018, 5:48pm

Hmm, if you go over to your host PC where you downloaded JetPack, go under the jetpack_download/ directory and extract the package libinfer-dev_5.0.0-1+cuda10.0_arm64.deb (it extracts from the GUI just like a zip archive would).

Then extract data.tar.xz and the TensorRT Developer Guide will be there under /usr/share/doc/tensorrt
Chapter 6 of the TensorRT Developer Guide covers DLA and will let you know what to look for in the source examples.

mechadeck · September 19, 2018, 6:56pm

Nice, I now have the TensorRT 5 Developer Guide. Between that and the Jetson-inference dev repo, it ought to be enough to get me started with the DLA…

S4WRXTTCS · September 19, 2018, 8:15pm

On page 37 of the developers guide it has an example of running an Alexnet model in the DLA, but that Alexnet model isn’t in the data directory.

Did I miss some step? It’s looking for AlexNet/AlexNet_N2.prototxt

guo.tang · September 19, 2018, 9:33pm

same problem here, after installation, the /usr/share/doc/tensorrt is pretty much empty

find /usr/share/doc/tensorrt

/usr/share/doc/tensorrt
/usr/share/doc/tensorrt/graphics
/usr/share/doc/tensorrt/changelog.Debian.gz
/usr/share/doc/tensorrt/common
/usr/share/doc/tensorrt/common/graphics
/usr/share/doc/tensorrt/common/scripts
/usr/share/doc/tensorrt/common/scripts/google-analytics
/usr/share/doc/tensorrt/common/scripts/tynt
/usr/share/doc/tensorrt/common/formatting
/usr/share/doc/tensorrt/copyright
/usr/share/doc/tensorrt/html
/usr/share/doc/tensorrt/html/search

dusty_nv · September 19, 2018, 9:50pm

Sorry about that, it appears to be an issue with the libinfer-dev_5.0.0-1+cuda10.0_arm64.deb package, we will fix it in the next update.

dusty_nv · September 19, 2018, 9:50pm

You can use the Alexnet model that jetson-inference downloads, or get it from here: https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet

If you don’t want to enable GPU fallback, you should remove the ‘prob’ layer from the prototxt, this is downloaded by the jetson-inference dev branch (direct link to modified prototxt here).

S4WRXTTCS · September 19, 2018, 10:42pm

Does this look correct? These numbers are with the default clock/power settings.

With DLA
nvidia@jetson-0423018055236:~/tensorrt/bin$ ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLA=1
deploy: /home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLA: 1
Input “data”: 3x227x227
Output “fc8”: 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 19.6518 ms (host walltime is 20.1839 ms, 99% percentile time is 24.7552).
Average over 10 runs is 19.1648 ms (host walltime is 19.8968 ms, 99% percentile time is 20.14).
Average over 10 runs is 19.7278 ms (host walltime is 20.2118 ms, 99% percentile time is 20.9213).
Average over 10 runs is 19.432 ms (host walltime is 20.2801 ms, 99% percentile time is 19.8216).
Average over 10 runs is 19.4548 ms (host walltime is 20.0524 ms, 99% percentile time is 19.9742).
Average over 10 runs is 19.5619 ms (host walltime is 20.1502 ms, 99% percentile time is 20.8323).
Average over 10 runs is 19.3835 ms (host walltime is 20.5315 ms, 99% percentile time is 19.5666).
Average over 10 runs is 14.5115 ms (host walltime is 14.9602 ms, 99% percentile time is 20.2947).
Average over 10 runs is 17.2961 ms (host walltime is 17.5563 ms, 99% percentile time is 19.1724).
Average over 10 runs is 19.5955 ms (host walltime is 19.9446 ms, 99% percentile time is 20.8753).

Without DLA
nvidia@jetson-0423018055236:~/tensorrt/bin$ ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet.prototxt --output=prob --fp16 --useDLA=0
deploy: /home/nvidia/tensorrt/data/AlexNet/alexnet.prototxt
output: prob
fp16
useDLA: 0
Input “data”: 3x227x227
Output “prob”: 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 11.621 ms (host walltime is 11.9936 ms, 99% percentile time is 15.8403).
Average over 10 runs is 6.79573 ms (host walltime is 6.99659 ms, 99% percentile time is 9.82736).
Average over 10 runs is 5.90683 ms (host walltime is 6.06322 ms, 99% percentile time is 9.10438).
Average over 10 runs is 5.90028 ms (host walltime is 5.99019 ms, 99% percentile time is 8.44902).
Average over 10 runs is 5.88509 ms (host walltime is 6.03842 ms, 99% percentile time is 8.50826).
Average over 10 runs is 5.79798 ms (host walltime is 5.93169 ms, 99% percentile time is 8.15715).
Average over 10 runs is 5.82621 ms (host walltime is 6.01957 ms, 99% percentile time is 8.75309).
Average over 10 runs is 5.89837 ms (host walltime is 5.99771 ms, 99% percentile time is 7.99946).
Average over 10 runs is 5.92456 ms (host walltime is 6.01304 ms, 99% percentile time is 9.00698).
Average over 10 runs is 5.6939 ms (host walltime is 5.89721 ms, 99% percentile time is 6.91085).

dusty_nv · September 20, 2018, 12:09am

Hi S4WRXTTCS, I get similar times running in MODE_15W. And with GoogleNet_noprob.prototxt I get ~75FPS on DLA.

mingxian32 · October 15, 2018, 8:14am

When I ./trtexec --deploy=/home/nvidia/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLA=1
It has this message “Parameter check failed at: …/builder/builder.cpp::setDefaultDeviceType::226, condition: mHwContext.hasDLA && static_cast(deviceType) <= mHwContext.nbDLAEngines”

Does this look correct?

dusty_nv · October 15, 2018, 4:14pm

Hi mingxian32, can you try running trtexec with sudo user privilege?

If you continue experiencing the issue, you may want to re-flash with JetPack and re-install L4T / TensorRT, as it appears there is some issue detecting DLA on your device.

ndnparis · November 22, 2018, 6:39pm

Hi,

It is weird! I don’t get the same times as you but there is big diffrences. And I also run 15W mode.

FYI : useDLA was replaced by useDLACore.

Thanks

With DLA
$ bin/trtexec --deploy=data/alexnet/alexnet_noprob.prototxt --output=fc8 --useDLACore=1 --fp16

deploy: data/alexnet/alexnet_noprob.prototxt
output: fc8
useDLACore: 1
fp16
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 10.2751 ms (host walltime is 11.1669 ms, 99% percentile time is 16.472).
Average over 10 runs is 9.39058 ms (host walltime is 10.2195 ms, 99% percentile time is 9.66144).
Average over 10 runs is 9.77683 ms (host walltime is 10.5191 ms, 99% percentile time is 10.6946).
Average over 10 runs is 10.605 ms (host walltime is 11.045 ms, 99% percentile time is 12.1528).
Average over 10 runs is 10.2545 ms (host walltime is 10.6109 ms, 99% percentile time is 10.7653).
Average over 10 runs is 10.1848 ms (host walltime is 10.6146 ms, 99% percentile time is 11.2446).
Average over 10 runs is 10.1359 ms (host walltime is 10.7114 ms, 99% percentile time is 10.5277).
Average over 10 runs is 9.88428 ms (host walltime is 10.3071 ms, 99% percentile time is 10.3476).
Average over 10 runs is 10.2372 ms (host walltime is 10.5192 ms, 99% percentile time is 11.1892).
Average over 10 runs is 10.3196 ms (host walltime is 10.6513 ms, 99% percentile time is 10.7878).

Without DLA
$ bin/trtexec --deploy=data/alexnet/alexnet.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback

deploy: data/alexnet/alexnet.prototxt
output: prob
useDLACore: 0
fp16
allowGPUFallback
Input "data": 3x227x227
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 9.63429 ms (host walltime is 10.5536 ms, 99% percentile time is 11.1104).
Average over 10 runs is 9.78064 ms (host walltime is 10.4858 ms, 99% percentile time is 11.6869).
Average over 10 runs is 9.78679 ms (host walltime is 10.7028 ms, 99% percentile time is 10.5165).
Average over 10 runs is 9.56096 ms (host walltime is 10.2707 ms, 99% percentile time is 9.94714).
Average over 10 runs is 9.59396 ms (host walltime is 10.5046 ms, 99% percentile time is 10.2052).
Average over 10 runs is 9.85251 ms (host walltime is 10.7424 ms, 99% percentile time is 10.8503).
Average over 10 runs is 9.5877 ms (host walltime is 10.4543 ms, 99% percentile time is 9.82426).
Average over 10 runs is 9.70454 ms (host walltime is 10.5341 ms, 99% percentile time is 10.0946).
Average over 10 runs is 9.64803 ms (host walltime is 10.5432 ms, 99% percentile time is 10.5482).
Average over 10 runs is 10.0365 ms (host walltime is 10.4798 ms, 99% percentile time is 10.2881).

leejaymin · January 15, 2019, 2:54am

Hello,

I also want to share my results on wDLA and w/oDLA.

In my case, I configured the board on max performance mode (30W).

My results were also weird because two execution times are similar regardless of DLA possibility.

With DLA
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=1

deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 1
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.7926 ms (host walltime is 6.98827 ms, 99% percentile time is 11.3653).
Average over 10 runs is 6.27773 ms (host walltime is 6.415 ms, 99% percentile time is 6.33238).
Average over 10 runs is 6.34808 ms (host walltime is 6.51361 ms, 99% percentile time is 6.4911).
Average over 10 runs is 6.44476 ms (host walltime is 6.61664 ms, 99% percentile time is 6.55571).
Average over 10 runs is 6.46809 ms (host walltime is 6.63325 ms, 99% percentile time is 6.5311).
Average over 10 runs is 6.39661 ms (host walltime is 6.61909 ms, 99% percentile time is 6.51162).
Average over 10 runs is 6.3873 ms (host walltime is 6.92348 ms, 99% percentile time is 6.56896).
Average over 10 runs is 6.3655 ms (host walltime is 6.60092 ms, 99% percentile time is 6.47475).
Average over 10 runs is 6.42326 ms (host walltime is 6.57476 ms, 99% percentile time is 6.58432).
Average over 10 runs is 6.4514 ms (host walltime is 6.60459 ms, 99% percentile time is 6.53622).

Without DLA
nvidia@jetson-0423718017159:/usr/src/tensorrt/bin$ sudo ./trtexec --deploy=/usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt --output=fc8 --fp16 --useDLACore=0

deploy: /usr/src/tensorrt/data/AlexNet/alexnet_noprob.prototxt
output: fc8
fp16
useDLACore: 0
Input "data": 3x227x227
Output "fc8": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=fc8, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.37521 ms (host walltime is 6.55121 ms, 99% percentile time is 7.07267).
Average over 10 runs is 6.43402 ms (host walltime is 6.61367 ms, 99% percentile time is 6.53213).
Average over 10 runs is 6.59314 ms (host walltime is 6.7674 ms, 99% percentile time is 6.8311).
Average over 10 runs is 6.46851 ms (host walltime is 6.63689 ms, 99% percentile time is 6.53619).
Average over 10 runs is 6.49032 ms (host walltime is 6.67388 ms, 99% percentile time is 6.56176).
Average over 10 runs is 6.50446 ms (host walltime is 6.66483 ms, 99% percentile time is 6.76147).
Average over 10 runs is 6.47384 ms (host walltime is 6.66675 ms, 99% percentile time is 6.61805).
Average over 10 runs is 6.45161 ms (host walltime is 6.63162 ms, 99% percentile time is 6.62525).
Average over 10 runs is 6.59969 ms (host walltime is 6.9295 ms, 99% percentile time is 8.73366).
Average over 10 runs is 6.48038 ms (host walltime is 6.63838 ms, 99% percentile time is 6.52291).

r7vme · July 4, 2019, 9:23pm

Hello, also sharing my results with VGG16 (from Keras applications, last layer is flatten)

with DLA (FP16): ~16.5 ms
without DLA (FP16): ~5.0 ms

I assume these results, because almost half of network is still running on GPU.

******************************
Layers running on DLA:.
block1_conv1/convolution, block1_conv1/BiasAdd, block1_conv1/Relu, block1_conv2/convolution, block1_conv2/BiasAdd, block1_conv2/Relu, block1_pool/MaxPool, block2_conv1/convolution, block2_conv1/BiasAdd, block2_conv1/Relu, block2_conv2/convolution, block2_conv2/BiasAdd, block2_conv2/Relu, b
lock2_pool/MaxPool, block3_conv1/convolution, block3_conv1/BiasAdd, block3_conv1/Relu, block3_conv2/convolution, block3_conv2/BiasAdd, block3_conv2/Relu, block3_conv3/convolution, block3_conv3/BiasAdd, block3_conv3/Relu, block3_pool/MaxPool, block4_conv1/convolution, block4_conv1/BiasAdd,.
block4_conv1/Relu, block4_conv2/convolution, block4_conv2/BiasAdd, block4_conv2/Relu, block4_conv3/convolution, block4_conv3/BiasAdd, block4_conv3/Relu, block4_pool/MaxPool, block5_conv1/convolution, block5_conv1/BiasAdd, block5_conv1/Relu, block5_conv2/convolution, block5_conv2/BiasAdd, b
lock5_conv2/Relu, block5_conv3/convolution, block5_conv3/BiasAdd, block5_conv3/Relu, block5_pool/MaxPool,.
******************************

******************************
Layers running on GPU:.
block1_conv1/kernel, block1_conv1/bias, block1_conv2/kernel, block1_conv2/bias, block2_conv1/kernel, block2_conv1/bias, block2_conv2/kernel, bloc
k2_conv2/bias, block3_conv1/kernel, block3_conv1/bias, block3_conv2/kernel, block3_conv2/bias, block3_conv3/kernel, block3_conv3/bias, block4_con
v1/kernel, block4_conv1/bias, block4_conv2/kernel, block4_conv2/bias, block4_conv3/kernel, block4_conv3/bias, block5_conv1/kernel, block5_conv1/b
ias, block5_conv2/kernel, block5_conv2/bias, block5_conv3/kernel, block5_conv3/bias, reshape_1/strided_slice/stack, reshape_1/strided_slice/stack
_1, reshape_1/Reshape/shape/1, (Unnamed Layer* 73) [Shuffle], reshape_1/Reshape,.
******************************