No speed up with TensorRT FP16 or INT8 on NVIDIA V100

I have been trying to use the trt.create_inference_graph to convert my Keras translated Tensorflow saved model from FP32 to FP16 and INT8,and then saving it in a format that can be used for TensorFlow serving. Code here - Google Colab

However running this with my test client, I see no change in the timing.

Here is the timing; What am I missing ?

FP32 - V100 -No optimization

(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.968112)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.8355837)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234411)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 20, ’ is ', 1.4128220081329346)
('Time for ', 10, ’ is ', 0.7228488922119141)

FP 32 with TensorFlow based Optimization - TransformGraph

without weight or model quantization
('Time for ', 10, ’ is ', 0.6342859268188477)

FP ?? with TensorFlow based Optimization - +Weight Quantized- TransformGraph

After weight quatized; Model size is 39 MB!! (from ~149 MB)
But time is double
('Time for ', 10, ’ is ', 1.201113224029541)

Model Quantization - Does not work (at least with TF Serving)

Using NVIDIA TensorRT Optimization (colab notebook)
FP16 - v100

(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.9681119)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.83558357)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234408)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 10, ’ is ', 0.8691568374633789)
'Time for ', 20, ’ is ', 1.6196839809417725)

INT 8

(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.9681119)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.83558357)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234408)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 10, ’ is ', 0.8551359176635742)

Here is the ouptut during optimisation. It says many operation are not supported ex Conv2D , though on the site (https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops) it is supported

2019-03-18 00:43:54.727555: I tensorflow/contrib/tensorrt/segment/segment.cc:443] There are 1978 ops of 45 different types in the graph that are not converted to TensorRT: Maximum, TopKV2, ConcatV2, NonMaxSuppressionV3, GatherV2, Exit, PadV2, Greater, NextIteration, TensorArrayWriteV3, Const, Identity, TensorArrayGatherV3, Switch, TensorArraySizeV3, Less, TensorArrayScatterV3, DataFormatVecPermute, Enter, Pack, LoopCond, StridedSlice, NoOp, Cast, Tile, Shape, ResizeNearestNeighbor, Size, Merge, TensorArrayV3, Conv2D, Range, Sub, Minimum, Placeholder, Add, Mul, TensorArrayReadV3, Reshape, GatherNd, Fill, LogicalAnd, Transpose, Where, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-03-18 00:43:55.848366: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 7
2019-03-18 00:43:56.724506: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 40 nodes succeeded.
2019-03-18 00:43:56.733469: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 41 nodes succeeded.
2019-03-18 00:43:56.804503: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 511 nodes succeeded.
2019-03-18 00:43:56.815349: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 67 nodes succeeded.
2019-03-18 00:43:56.815778: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node boxes/TRTEngineOp_4 added for segment 4 consisting of 11 nodes succeeded.
2019-03-18 00:43:56.815940: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node boxes/TRTEngineOp_5 added for segment 5 consisting of 11 nodes succeeded.
2019-03-18 00:43:56.816075: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_6 added for segment 6 consisting of 7 nodes succeeded.
2019-03-18 00:43:56.956398: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
2019-03-18 00:43:56.956456: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 4560 nodes (-293), 7069 edges (-221), time = 750.528ms.
2019-03-18 00:43:56.956512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   layout: Graph size after: 4617 nodes (57), 7123 edges (54), time = 246.445ms.
2019-03-18 00:43:56.956537: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   constant folding: Graph size after: 4601 nodes (-16), 7123 edges (0), time = 608.315ms.
2019-03-18 00:43:56.956560: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583]   TensorRTOptimizer: Graph size after: 3920 nodes (-681), 6398 edges (-725), time = 2563.16504ms.

I got some sort of Optimisation now, around 20 percent
Here is the TF Serving output

019-03-19 01:39:30.136933: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: retinanet version: 10}
2019-03-19 01:39:30.177492: I tensorflow_serving/model_servers/server.cc:313] Running gRPC ModelServer at 0.0.0.0:8500 ...
2019-03-19 01:39:30.199864: I tensorflow_serving/model_servers/server.cc:333] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 237] RAW: Entering the event loop ...
2019-03-19 01:43:27.233409: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/TRTEngineOp_2 with batch size 8
2019-03-19 01:43:57.222618: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/TRTEngineOp_1 with batch size 8
2019-03-19 01:43:57.272867: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/TRTEngineOp_3 with batch size 8
2019-03-19 01:44:01.431554: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/TRTEngineOp_0 with batch size 8
2019-03-19 01:44:07.837837: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/TRTEngineOp_6 with batch size 8
2019-03-19 01:44:07.842032: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/boxes/TRTEngineOp_5 with batch size 8
2019-03-19 01:44:07.842060: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:496] Building a new TensorRT engine for import/boxes/TRTEngineOp_4 with batch size 8
2019-03-19 01:44:07.844806: W external/org_tensorflow/tensorflow/contrib/tensorrt/log/trt_logger.cc:34] DefaultLogger Tensor DataType is determined at build time for tensors not marked as input or output.
2019-03-19 01:44:07.845004: W external/org_tensorflow/tensorflow/contrib/tensorrt/log/trt_logger.cc:34] DefaultLogger Tensor DataType is determined at build time for tensors not marked as input or output.
2019-03-19 01:44:07.845012: W external/org_tensorflow/tensorflow/contrib/tensorrt/log/trt_logger.cc:34] DefaultLogger Tensor DataType is determined at build time for tensors not marked as input or output.
Tensor DataType is determined at build time for tensors not marked as input or output.
2019-03-19 01:44:07.845282: W external/org_tensorflow/tensorflow/contrib/tensorrt/log/trt_logger.cc:34] DefaultLogger Tensor DataType is determined at build time for tensors not marked as input or output.

Hello,

To help us debug, can you provide more details about the changes you made to obtain the 20 percent optimization and a small repro package so we can look further into this question?

Thanks.

What I did differently was I ran the optimization code on the same V100 GPU. Earlier I was doing it in my machine (in CPU) and using the converted model in v100 machine

I will add details to the Colab page in a short time and post here. In the meantime, I just have one question.

In the warning message, I see that many operations including those that are deemed supported by TensorRT like ‘Conv2D’ are not getting converted. Is there an explanation for this. Thanks

There are 1978 ops of 45 different types in the graph that are not converted to TensorRT: Maximum, TopKV2, ConcatV2, NonMaxSuppressionV3, GatherV2, Exit, PadV2, Greater, NextIteration, TensorArrayWriteV3, Const, Identity, TensorArrayGatherV3, Switch, TensorArraySizeV3, Less, TensorArrayScatterV3, DataFormatVecPermute, Enter, Pack, LoopCond, StridedSlice, NoOp, Cast, Tile, Shape, ResizeNearestNeighbor, Size, Merge, TensorArrayV3, Conv2D, Range, Sub, Minimum, Placeholder, Add, Mul, TensorArrayReadV3, Reshape, GatherNd, Fill, LogicalAnd, Transpose, Where, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).

Sorry for the time it took to reply. I had to do this in my free time. I have given the full details here in this blog post

In this, I tried to convert an SSD model via TensorRT from FP 32 to FP 16. Here is the error I got; basically most of the operations including the Conv2D of the Convolutional Neural Network are not coverted. Which means then that TesnorRT has not much value as of now; or I am missing something.

tensorflow/contrib/tensorrt/segment/segment.cc:443] There are 3962 ops of 51 different types in the graph that are not converted to TensorRT: TopKV2, NonMaxSuppressionV2, TensorArrayWriteV3, Const, Squeeze, ResizeBilinear, Maximum, Where, Add, Placeholder, Switch, TensorArrayGatherV3, NextIteration, Greater, TensorArraySizeV3, NoOp, TensorArrayV3, LoopCond, Less, StridedSlice, TensorArrayScatterV3, ExpandDims, Exit, Cast, Identity, Shape, RealDiv, TensorArrayReadV3, Reshape, Merge, Enter, Range, Conv2D, Mul, Equal, Sub, Minimum, Tile, Pack, Split, ZerosLike, ConcatV2, Size, Unpack, Assert, DataFormatVecPermute, Transpose, Gather, Exp, Slice, Fill, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).

I see that many operations including those that are deemed supported by TensorRT like ‘Conv2D’ are not getting converted. Is there an explanation for this.

I will like to know the answer to this as well. I get similar message when using tf.trt.

Hi alexcpn

I have met the same problem. I got the same output maessage of tf_serving as yours, and no speed up with TensorRT FP16 or INT8.

Have you solved it now?