Unable to verify Xavier inference benchmarks

vasuagrawal · February 13, 2019, 7:41pm

I’m trying to verify classification benchmarks on the Xavier, but I’m unable to replicate the performance numbers that I see posted online. I’ve tried to follow the following source:

I was largely able to follow these instructions, but I had to make the changes listed below in order to get the uff converter to compile:

https://devtalk.nvidia.com/default/topic/1043619/jetson-tf_to_trt_image_classification/?offset=6

After compilation, I ran the following scripts, without modification:

source scripts/download_models.sh
python3 scripts/models_to_frozen_graphs.py
source scripts/download_images.sh
python3 scripts/frozen_graphs_to_plans.py
python3 scripts/test_trt.py
python3 scripts/test_tf.py

Below are the benchmark numbers for tensorflow (data/test_output_tf.txt):

vgg_16 4218.965682983398
inception_v1 15.160059928894043
inception_v2 16.9545841217041
inception_v3 31.412348747253418
inception_v4 57.95839786529541
inception_resnet_v2 70.55327415466309
resnet_v1_50 25.6368350982666
resnet_v1_101 45.75383186340332
resnet_v1_152 60.83596229553223
resnet_v2_50 33.184447288513184
resnet_v2_101 64.34244155883789
resnet_v2_152 84.0024471282959
mobilenet_v1_1p0_224 12.122135162353516
mobilenet_v1_0p5_160 6.786251068115234
mobilenet_v1_0p25_128 7.124357223510742

And for TensorRT (data/test_output_trt.txt):

data/plans/vgg_16.plan 12.1812
data/plans/inception_v1.plan 5.35698
data/plans/inception_v3.plan 22.4136
data/plans/inception_v4.plan 21.4755
data/plans/inception_resnet_v2.plan 23.2827
data/plans/resnet_v2_50.plan 8.40148
data/plans/resnet_v2_101.plan 16.299
data/plans/resnet_v2_152.plan 20.1305
data/plans/mobilenet_v1_1p0_224.plan 6.92651
data/plans/mobilenet_v1_0p5_160.plan 2.98501
data/plans/mobilenet_v1_0p25_128.plan 3.13828

The times are somewhere between 1x and 3x faster than those reported on the TX2 in the GitHub link, which does not seem consistent with the numbers reported below. It also seems that some of the models failed to convert.

I also tried to follow the instructions posted in the link above, which explicitly calls trtexec. I wasn’t able to find the resnet50.prototxt file listed in the link, but there was a googlenet.prototxt provided in /usr/src/tensorrt/data/googlenet/googlenet.prototxt that I tried:

int8 on GPU

./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --int8 --batch=8 --iterations=10000 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
int8
batch: 8
iterations: 10000
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 20x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 10.507 ms (host walltime is 10.5863 ms, 99% percentile time is 40.4644).
Average over 100 runs is 6.29326 ms (host walltime is 6.35656 ms, 99% percentile time is 8.84531).
Average over 100 runs is 6.23239 ms (host walltime is 6.2915 ms, 99% percentile time is 8.29283).

fp16 on GPU

./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useSpinWait
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useSpinWait
Input "data": 3x224x224
Output "prob": 20x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 9.11288 ms (host walltime is 9.17181 ms, 99% percentile time is 41.3132).
Average over 100 runs is 8.18154 ms (host walltime is 8.23468 ms, 99% percentile time is 11.0188).
Average over 100 runs is 8.12368 ms (host walltime is 8.17516 ms, 99% percentile time is 11.0971).

fp16 on DLA core 0

./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=0 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useDLACore: 0
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 20x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 28.7429 ms (host walltime is 28.8939 ms, 99% percentile time is 31.6979).
Average over 100 runs is 28.6642 ms (host walltime is 28.8532 ms, 99% percentile time is 30.1394).
Average over 100 runs is 28.5911 ms (host walltime is 28.7823 ms, 99% percentile time is 29.482).

fp16 on DLA core 1

./trtexec --avgRuns=100 --deploy=../data/googlenet/googlenet.prototxt --fp16 --batch=8 --iterations=10000 --output=prob --useDLACore=1 --useSpinWait --allowGPUFallback
avgRuns: 100
deploy: ../data/googlenet/googlenet.prototxt
fp16
batch: 8
iterations: 10000
output: prob
useDLACore: 1
useSpinWait
allowGPUFallback
Input "data": 3x224x224
Output "prob": 20x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 100 runs is 28.6083 ms (host walltime is 28.7687 ms, 99% percentile time is 33.8545).
Average over 100 runs is 28.4075 ms (host walltime is 28.6114 ms, 99% percentile time is 29.0141).
Average over 100 runs is 28.5257 ms (host walltime is 28.6889 ms, 99% percentile time is 30.4743).

The DLA times are slower than the GPU times, even when using FP16 on the GPU. According to the benchmarks provided above, I should be observing approximately 4 ms inference time with batch size 8 on the DLA cores. Based on the output text, it looks like the inference is falling back to the GPU, but that should hopefully not incur a 3x performance penalty.

All benchmarks were taken on an AGX Xavier Devkit with Jetpack 4.1.1 installed in MAX_N mode.

Is there another set of instructions or links that I should be following in order to make use of the DLAs? I want to ensure that I can replicate the provided benchmarks before trying to use the DLAs to perform object detection. I am following the above links due to the suggestions here:

https://devtalk.nvidia.com/default/topic/1047297/jetson-agx-xavier/dla-for-object-detection-supported-with-tf-trt-on-xavier-/

dusty_nv · February 13, 2019, 8:56pm

Hi Vasu, you can find the links to the other prototxt’s for trtexec in this post: [url]https://devtalk.nvidia.com/default/topic/1046147/jetson-agx-xavier/instructions-and-models-to-duplicate-jetson-agx-xavier-deep-learning-inference-benchmarks/post/5308612/#5308612[/url]

The benchmark results report the aggregate performance of the GPU and two DLA’s running concurrently (GPU INT8 and DLA’s FP16), like seen by launching these commands simultaneously: [url]Jetson Benchmarks | NVIDIA Developer

It is expected that DLA is slower than the GPU (but DLA is more power efficient). Also see this topic for reference regarding concurrent execution.

vasuagrawal · February 13, 2019, 9:17pm

Got it. Thanks Dusty! Looks like I didn’t go back far enough when searching for related threads.

vasuagrawal · February 14, 2019, 3:34am

I was able to verify the benchmarks successfully using the links you posted. Thanks!

I did notice that the inference time on the DLA cores seems to be affected by the loading of the GPU. When not running the GPU inference at all, the DLA inference times were at their fastest, with them getting slightly (up to 25% or so) slower when running GPU inference. Interestingly, the DLA core inference also slowed down different depending on whether I was running fp16 or int8 inference.

Here’s my benchmarks, if anyone wants to see them for comparision. I also tested Resnet 101 and Resnet 152.

I did notice that periodically I would get the following error on the DLA cores (the process running inference on the GPU would keep running). It seems to only occur with batch size 1. I found the following thread with the same error, but didn’t see any discussion on what could be causing the error there (since my batch size is already set very low). Any idea what could be causing it?

https://devtalk.nvidia.com/default/topic/1042518/jetson-agx-xavier/technical-reference-manual-availability-/post/5304231/#5304231

NVMEDIA_DLA : 1361, ERROR: Submit failed.
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
__NvRmMemMap:310 [12] mmap failed
NVMEDIA_DLA : 1361, ERROR: Submit failed.
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    29024 abort (core dumped)  ./trtexec --avgRuns=100  --fp16 --batch=1 --iterations=10000 --output=prob

EDIT: Just tried to check memory usage during inference. Running Googlenet with 1 batch size (on DLA and GPU), I’m sitting at about 13.2 GB memory usage. Is it possible that the above error is caused by running out of memory?

dusty_nv · February 14, 2019, 4:27pm

What network are you running when you get the intermittent error? Do you ever see it if you run just a single DLA process alone, or if you try a different batch size? There have been stability improvements made to the DLA sw in the upcoming JetPack release which should remedy it.

vasuagrawal · February 14, 2019, 8:20pm

I had the error appear with both Alexnet and Googlenet (from the links you provided in another thread), but not with any of the Resnet models. The errors appeared when running GPU and DLA concurrently, and appeared on both DLAs at approximately the same time (though not exactly). Both trtexec processes on the DLA were started at just about the same time. I didn’t try running the DLA inference alone. The errors only appeared when using a batch size of 1 - I wasn’t able to recreate the error with a higher batch size. I’ll try a few more configurations and let you know if there’s any more crashes.

Looking forward to the next (production?) release of JetPack!

dusty_nv · February 14, 2019, 10:15pm

Thanks Vasu. Yes, the next version of JetPack will be a production release for Xavier.

keithdm · April 16, 2019, 9:12pm

In this link it shows a performance of images/second, how can I get those metrics? Do I need to download a dataset ?

dusty_nv · April 17, 2019, 12:06am

Hi keithdm, trt-exec reports the processing time per image (in milliseconds). To convert that to images per second, take 1000 / time

Vurc · April 17, 2019, 9:03am

Since this is related to benchmarking I did not want to open new thread.

When I use trtexec with resnet50 caffe model I get 3 ms(as in official benchmark) but if I use the uff file which is produced with tf_to_trt_imageclassification scripts I get 8ms.

The only difference between models is that tensorflow model has fused layers.

Is tensorRT better at optimizing caffe models?

keithdm · April 17, 2019, 2:41pm

Sorry Dustin, I’m still having trouble interpreting the results, so I have results similar to the first post using google net batch size 8:

Jetpack 4.2

GPU int8 Average over 100 runs is 6.23211ms(host walltime is 6.33103ms, 99% percentile time is 8.07334)
DLA 0 fp16 Average over 100 runs is 28.9974ms (host walltime is 29.1836ms, 99% percentile time is 30.2756)
DLA 1 fp16 Average over 100 runs is 29.0472ms (host walltime is 29.2843ms, 99% percentile time is 30.289)

So I’m calculating using the bold numbers:
Performance Formula = 1000/(latency*batch) = [images/sec]
Cumulative Performance = (GPU + DLA1 + DLA0) /3 = [images/sec]

GPU int8 = 1000/(6.232*10^-3 * 8) = 20057 images/sec
DLA 0 fp16 = 1000/(28.9974ms *8) = 4310 images/sec
DLA 1 fp16 = 1000/(29.0472ms *8) = 4303 images/sec

Cumulative Performance = (20057+4310+4303)/3 = 9556 images/sec

The latency on the website for googlenet batch size 8 says 7.9ms and 1015 img/sec so I’m wondering if im using the right time(s).

dusty_nv · April 17, 2019, 5:25pm

Hi Keith, your images/sec calculation is using an extra 10^-3, and the cumulative performance is the sum of all three, since they are running concurrently.

images/sec = 1000 / latency * batch size

GPU INT8   = 1000 / 6.23211 * 8 = 1283.67439 images/sec
DLA_0 FP16 = 1000 / 28.9974 * 8 = 275.8868036 images/sec
DLA_1 FP16 = 1000 / 29.0472 * 8 = 275.4138092 images/sec

cumulative images/sec = GPU + DLA_0 + DLA_1

1283.67439 + 275.8868036 + 275.4138092 = 1834.975 images/sec

cumulative latency = 1000 / fps * batch size
```
1000 / 1834.975 * 8 = 4.359732 ms
```

Looks like you got better results with performance improvements from JetPack 4.2 than the previous numbers we published for JetPack 4.1.1.

keithdm · February 26, 2020, 7:41pm

dusty_nv:

Hi Keith, your images/sec calculation is using an extra 10^-3, and the cumulative performance is the sum of all three, since they are running concurrently.
images/sec = 1000 / latency * batch size
GPU INT8   = 1000 / 6.23211 * 8 = 1283.67439 images/sec
DLA_0 FP16 = 1000 / 28.9974 * 8 = 275.8868036 images/sec
DLA_1 FP16 = 1000 / 29.0472 * 8 = 275.4138092 images/sec
cumulative images/sec = GPU + DLA_0 + DLA_1
1283.67439 + 275.8868036 + 275.4138092 = 1834.975 images/sec
cumulative latency = 1000 / fps * batch size
1000 / 1834.975 * 8 = 4.359732 ms
Looks like you got better results with performance improvements from JetPack 4.2 than the previous numbers we published for JetPack 4.1.1.

Hi Dusty,

Is it possible to translate these values into TOPs or TFLOPs? This would help me compare the inference capabilities of Xavier versus another NVIDIA GPU i.e. Quadro RTX.

dusty_nv · March 5, 2020, 3:39pm

Not directly - what you typically would do, would be to look up a paper about the network in question, and see if it contains an OPs/FLOPs number for the network. Then you multiple the benchmark’s images/sec by this number.

If you want an apples-to-apples comparison, it would be best just to run the network on both GPUs and compare the times. Otherwise, you can get a rough estimate by comparing the total TOPS count (i.e. 32 TOPS for AGX Xavier)

Klein92 · March 9, 2020, 11:58pm

Have you guys figured out why?

I also met the same issue here:

__NvRmMemMap:310 [12] mmap failed
NVMEDIA_DLA : 1361, ERROR: Submit failed.
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
....
Caught exception while logging: [Pool exception]
Caught exception while logging: [Pool exception]
Caught exception while logging: [Pool exception]
Caught exception while logging: [Pool exception]

I am on tensorrt-5.0.3.2 with cuda 10.0 on Jetson Xavier. Any hints on why this would happen? @dusty_nv

I have DLA enabled and tried to reproduce this issue but never got the exact same error. Since I can hardly reproduce the issue, I don’t think upgrading to a later version will help. I searched online but find nothing and this is the only post related.

Initially I thought this might be related to memory and I tried to spam the memory both with the service running inference and when the service is about to spinning up. But I could not reproduce the error.

Any hints will be helpful… thanks in advance.

eyalhir74 · January 31, 2021, 5:26am

Hi @Klein92 ,
I’m seeing the same errors. Were you able to find what the problem was?

thanks
Eyal

keithdm · April 6, 2021, 1:45pm

Hi Dusty, just want to confirm with NVIDIA which Latency # as reported by TRTEXEC are used for the benchmarks?
Is it the Host Latency, GPU Compute time or something else?

Here is an example printout.

[04/05/2021-12:36:30] [I] Host Latency
[04/05/2021-12:36:30] [I] min: 6.76562 ms (end to end 6.84555 ms)
[04/05/2021-12:36:30] [I] max: 47.05 ms (end to end 47.1611 ms)
[04/05/2021-12:36:30] [I] mean: 8.56727 ms (end to end 8.60459 ms)
[04/05/2021-12:36:30] [I] median: 8.40625 ms (end to end 8.42188 ms)
[04/05/2021-12:36:30] [I] percentile: 11.5508 ms at 99% (end to end 11.6953 ms at 99%)
[04/05/2021-12:36:30] [I] throughput: 929.695 qps
[04/05/2021-12:36:30] [I] walltime: 86.0497 s
[04/05/2021-12:36:30] [I] Enqueue Time
[04/05/2021-12:36:30] [I] min: 0.420166 ms
[04/05/2021-12:36:30] [I] max: 13.7656 ms
[04/05/2021-12:36:30] [I] median: 0.738281 ms
[04/05/2021-12:36:30] [I] GPU Compute
[04/05/2021-12:36:30] [I] min: 6.60791 ms
[04/05/2021-12:36:30] [I] max: 46.8982 ms
[04/05/2021-12:36:30] [I] mean: 8.38616 ms
[04/05/2021-12:36:30] [I] median: 8.23438 ms
[04/05/2021-12:36:30] [I] percentile: 11.3398 ms at 99%
[04/05/2021-12:36:30] [I] total compute time: 83.8616 s
&&&& PASSED TensorRT.trtexec # ./trtexec --avgRuns=100 --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --int8 --batch=8 --iterations=10000 --output=prob