Lack of FPS after successfully deploy TLT to Deepstream.

m.billson16 · December 3, 2019, 2:26am

Hello, I have successfully deploy TLT to Deepstream, but when I tried to run, I got a very bad FPS, which is around 4 - 5 fps. I used Resnet18 as a pretrained model and the resolution of datasets is about 1280 X 720 . Do you have any solution? Is it because my datasets’ resolution is too large?

Morganh · December 3, 2019, 4:56am

Hi m.billson16,
Which platform did you run, nano or Xaiver? Which detect network did you train, detectnet_v2, SSD or Faster-rcnn?
More, could you paste the running command line along with logs?
Thanks a lot.

m.billson16 · December 3, 2019, 6:53am

I run my program on Jetson Nano using detectnet_v2 (fp16 mode). I’m trying to stream via usb-camera to detect the object that I use for my dataset.

Here is my running command:

deepstream-app -c /home/deepstream/Desktop/TA/source1_usb_dec_infer_resnet_fp16.txt

My config_infer_primary.txt

[property]
gpu-id=0
# preprocessing parameters.
net-scale-factor=0.0039215697906911373
model-color-format=0

# model paths.
labelfile-path=/home/deepstream/Desktop/TA/labels.txt
tlt-encoded-model=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.etlt
tlt-model-key=dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1
input-dims=3;720;1280;0 # where c = number of channels, h = height of the model input, w = width of model input, 0: implies CHW format.
uff-input-blob-name=input_1
batch-size=4 
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=3
interval=0
gie-unique-id=1
is-classifier=0
output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd
#enable_dbscan=0

[class-attrs-all]
threshold=0.2
group-threshold=1
## Set eps=0.7 and minBoxes for enable-dbscan=1
eps=0.2
#minBoxes=3
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

My code

# Copyright (c) 2018 NVIDIA Corporation.  All rights reserved.
#
# NVIDIA Corporation and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA Corporation is strictly prohibited.

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
#gie-kitti-output-dir=streamscl

[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=1
camera-width=1280
camera-height=720
camera-fps-n=30
camera-fps-d=1
camera-v4l2-dev-node=0

[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming 5=Overlay
type=5
sync=0
display-id=0
offset-x=0
offset-y=0
width=0
height=0
overlay-id=1
source-id=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265 3=mpeg4
codec=1
sync=0
bitrate=2000000
output-file=out.mp4
source-id=0

[sink2]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File 4=RTSPStreaming 5=Overlay
type=4
#1=h264 2=h265
codec=1
sync=0
bitrate=4000000
# set below properties in case of RTSPStreaming
rtsp-port=8554
udp-port=5400


[osd]
enable=1
border-width=2
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0

[streammux]
##Boolean property to inform muxer that sources are live
live-source=1
batch-size=1
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=480
height=272

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
model-engine-file=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine
#Required to display the PGIE labels, should be added even when using config-file
#property
batch-size=1
#Required by the app for OSD, not a plugin property
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=0
#Required by the app for SGIE, when used along with config-file property
gie-unique-id=1
config-file=config_infer_primary.txt

[tests]
file-loop=0

Morganh · December 3, 2019, 8:38am

Hi m.billson16,
Would you please check or try below items? Thanks.
1)Could you please run and paste the result

$ /usr/src/tensorrt/bin/trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

Did you ever run “tlt-prune” and then re-train to get a pruned tlt model? What’s the prune-ratio? You can find it in “tlt-prune” log.
If yes, what’s the size of the pruned tlt model, your etlt model and your resnet18_detector_fp16.engine?
What’s the “-b” setting when you run “tlt-converter”? Can you paste the command line when your run it?
If you already generate trt engine, please consider replace your

tlt-encoded-model=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.etlt
   tlt-model-key=dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1

to

model-engine-file=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine

What’s the fps result when you run deepstream-app against a local file instead of the stream from usb-camera?

m.billson16 · December 3, 2019, 10:21am

Hello Morganh, thank you for your help.

For point number 1, here is the result:

[I] loadEngine: /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine
[I] fp16
[I] batch: 1
[I] iterations: 20
[I] output: output_cov/Sigmoid,output_bbox/BiasAdd
[I] useSpinWait
[E] [TRT] The engine plan file is not compatible with this version of TensorRT, expecting library version 5.1.6 got 5.1.5, please rebuild.
[I] /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine has been successfully loaded.
[E] Engine could not be created
&&&& FAILED TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

I had run the tlt-prune to get the pruned model. and the prune ratio is 1, because of the threshold is about 5.2e-6
my model size is about 43 M
my -b is 10
command line:

tlt-converter $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt \
   -k $KEY \
   -o output_cov/Sigmoid,outtput_bbox/BiasAdd \
   -d 3,720,1280 \
   -m 16 \
   -t fp16\
   -e $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.engine \
   -w 50000000 \
   -b 10

I tried, but I end up with an error
Around 30 fps

do you have any idea?

Morganh · December 3, 2019, 10:38am

Hi m.billson16,
Where did you generate resnet18_detector_fp16.engine, from nano?
If not, please download Jetson platform version’s tlt-converter(https://developer.nvidia.com/tlt-converter) and run it in nano in order to generate the trt engine.

Then, to see if item 1 and 5 unblocked.

m.billson16 · December 3, 2019, 2:40pm

Hello Morganh, I tried to use tlt-converter in Jetson’s platform. But my Jetson Nano screen become freeze for a long time. And I think the process to convert from etlt to engine still not finished yet. Do you have any solution?

m.billson16 · December 3, 2019, 3:22pm

Hello Morganh, I solved this problem.

when I tried to run

/usr/src/tensorrt/bin/trtexec --loadEngine=/home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

the result is like this:

&&&& RUNNING TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait
[I] loadEngine: /home/deepstream/Desktop/TA/resnet18_detector_fp16.engine
[I] fp16
[I] batch: 1
[I] iterations: 20
[I] output: output_cov/Sigmoid,output_bbox/BiasAdd
[I] useSpinWait
[I] /home/deepstream/Desktop/TA/resnet18_detector_fp16.engine has been successfully loaded.
[I] Average over 10 runs is 230.336 ms (host walltime is 230.424 ms, 99% percentile time is 327.723).
[I] Average over 10 runs is 219.831 ms (host walltime is 219.874 ms, 99% percentile time is 221.005).
[I] Average over 10 runs is 219.561 ms (host walltime is 219.603 ms, 99% percentile time is 220.26).
[I] Average over 10 runs is 220.107 ms (host walltime is 220.15 ms, 99% percentile time is 224.193).
[I] Average over 10 runs is 219.986 ms (host walltime is 220.029 ms, 99% percentile time is 223.398).
[I] Average over 10 runs is 220.043 ms (host walltime is 220.084 ms, 99% percentile time is 223.812).
[I] Average over 10 runs is 220.165 ms (host walltime is 220.206 ms, 99% percentile time is 223.842).
[I] Average over 10 runs is 220.15 ms (host walltime is 220.196 ms, 99% percentile time is 223.773).
[I] Average over 10 runs is 219.909 ms (host walltime is 219.951 ms, 99% percentile time is 222.098).
[I] Average over 10 runs is 224.225 ms (host walltime is 224.27 ms, 99% percentile time is 229.449).
[I] Average over 10 runs is 225.472 ms (host walltime is 225.532 ms, 99% percentile time is 239.276).
[I] Average over 10 runs is 221.678 ms (host walltime is 221.736 ms, 99% percentile time is 230.706).
[I] Average over 10 runs is 223.566 ms (host walltime is 223.61 ms, 99% percentile time is 233.467).
[I] Average over 10 runs is 223.958 ms (host walltime is 224.009 ms, 99% percentile time is 231.804).
[I] Average over 10 runs is 221.063 ms (host walltime is 221.103 ms, 99% percentile time is 224.068).
[I] Average over 10 runs is 224.265 ms (host walltime is 224.306 ms, 99% percentile time is 230.746).
[I] Average over 10 runs is 220.096 ms (host walltime is 220.136 ms, 99% percentile time is 223.987).
[I] Average over 10 runs is 225.106 ms (host walltime is 225.163 ms, 99% percentile time is 232.841).
[I] Average over 10 runs is 222.018 ms (host walltime is 222.079 ms, 99% percentile time is 227.116).
[I] Average over 10 runs is 220.158 ms (host walltime is 220.199 ms, 99% percentile time is 224.125).
&&&& PASSED TensorRT.trtexec # ./trtexec --loadEngine=/home/deepstream/Desktop/TA/resnet18_detector_fp16.engine --fp16 --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

Do you have any idea?

Morganh · December 3, 2019, 3:25pm

Hi m.billson16 ,
Firstly, could you please run “$ sudo nvpmodel -q 0” and “$ sudo jetson_clocks” for your nano?
Second, can you paste your running command line?
Last, please ctrl+c and run again. It should not freeze for a long time.

m.billson16 · December 3, 2019, 3:37pm

Hello Morganh, thanks for the help. Fortunately, I solved the freeze problem.

For sudo nvpmodel -q 0, I got this result

NVPM WARN: fan mode is not set!
NV Power Mode: MAXN
0

But I got nothing when I run sudo jetson_clocks

and also, what running command line should I paste here?

Morganh · December 3, 2019, 4:26pm

Hi m.billson,
Running “run sudo jetson_clocks” will set CPU/EMC/GPU clocks to max.
Could you please paste the command line when you run “tlt-export”(in x86_64) and “tlt-converter”(in nano)?

m.billson16 · December 3, 2019, 4:32pm

Hello Morganh, thanks for the help.

Here is my tlt-export command line

!tlt-export $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
            -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt \
            --outputs output_cov/Sigmoid,output_bbox/BiasAdd \
            -k $KEY \
            --input_dims 3,720,1280 \
            --max_workspace_size 1073741824 \
            --export_module detectnet_v2 \
            --data_type fp16 \
            --batches 10 \
            --cal_batch_size 4 \
            --verbose

and here is the tlt-converter command line in nano

./tlt-converter $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_fp16.etlt -k dWhrajZsbWtobW8wZ2UycmhnaDdqZmw3cGg6MWNhZGU2NTYtNjA5Yy00ZWQ0LTgxZTktYzE4ZmZkOWI4NWI1 -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,720,1280 -t fp16 -e /home/deepstream/Desktop/TA/pruned/resnet18_detector_fp16.engine -w 50000000 -b 10

Do you have any idea?

Morganh · December 3, 2019, 4:36pm

More, please consider to do further experiments:

try to run “tlt-prune” to prune your trained tlt model(since your prune-ratio is 1 which means tlt model is not pruned), then retrain again to get a pruned tlt model. Then export and generate trt engine again to test.
try to resize your 1280x720 dataset offline to a smaller size, and then train–> prune–> retrain–> export → generate trt engine again for test.

BTW, for my previous item (6), you mentioned that you can get 30fps when you run deepstream-app against a local file. Can you confirm that you run this with the generated trt engine? If yes, that means with usb camera, the fps will drop from 30fps to 4~5 fps.So, is it any issue or bottleneck for the usb camera or something else?

m.billson16 · December 4, 2019, 12:41am

Hello Morganh, thanks for the idea. Actually I do another experiment, I changed the prune threshold into 0.5. but I still get the pruning ratio is 1.0

!tlt-prune -pm $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt \
           -o $USER_EXPERIMENT_DIR/experiment_dir_pruned/ \
           -eq union \
           -pth 0.5 \
           -k $KEY

Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:25,732 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:27.083059: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-03 18:35:27.132341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-03 18:35:27.132846: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7ae6c00 executing computations on platform CUDA. Devices:
2019-12-03 18:35:27.132890: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 950M, Compute Capability 5.0
2019-12-03 18:35:27.154163: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2593765000 Hz
2019-12-03 18:35:27.154801: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7bfed50 executing computations on platform Host. Devices:
2019-12-03 18:35:27.154838: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-03 18:35:27.155037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.75GiB
2019-12-03 18:35:27.155076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-03 18:35:27.155805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-03 18:35:27.155831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-12-03 18:35:27.155845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-12-03 18:35:27.155941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
2019-12-03 18:35:28,668 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2019-12-03 18:35:29,197 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2019-12-03 18:35:49,277 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 1.0

Do you have any idea?

m.billson16 · December 4, 2019, 3:08am

Helo Morganh, I tried to resize my dataset from 1280 x 720 to 640 x 480, I faced the same problem, lack of FPS (around 11-12 fps) , and sometimes the object is my dataset isn’t detected, but the others item that was not being used as my datasets, is detected. Do you have any idea?

m.billson16 · December 4, 2019, 3:09am

Helo Morganh, I tried to resize my dataset from 1280 x 720 to 640 x 480, I faced the same problem, lack of FPS (around 11-12 fps) , and sometimes the object is my dataset isn’t detected, but the others item that was not being used as my datasets, is detected. Do you have any idea?

Morganh · December 4, 2019, 3:22am

Did you write script to resize corresponding bboxes(xmin,ymin,xmax,ymax) of all the label text files?

Morganh · December 4, 2019, 3:58am

Hello Morganh, thanks for the idea. Actually I do another experiment, I changed the prune threshold into 0.5. but I still get the pruning ratio is 1.0

!tlt-prune -pm $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt \
           -o $USER_EXPERIMENT_DIR/experiment_dir_pruned/ \
           -eq union \
           -pth 0.5 \
           -k $KEY

Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:25,732 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-12-03 18:35:27.083059: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-03 18:35:27.132341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-03 18:35:27.132846: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7ae6c00 executing computations on platform CUDA. Devices:
2019-12-03 18:35:27.132890: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 950M, Compute Capability 5.0
2019-12-03 18:35:27.154163: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2593765000 Hz
2019-12-03 18:35:27.154801: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7bfed50 executing computations on platform Host. Devices:
2019-12-03 18:35:27.154838: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-03 18:35:27.155037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.75GiB
2019-12-03 18:35:27.155076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-03 18:35:27.155805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-03 18:35:27.155831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-12-03 18:35:27.155845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-12-03 18:35:27.155941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
2019-12-03 18:35:28,668 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2019-12-03 18:35:29,197 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2019-12-03 18:35:49,277 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 1.0

Do you have any idea?

It does not make sense. Could you please try more pth? Thanks.