FCN AlexNet Convolution Very Slow Without Shift Layer

Hi,

I have been testing some algorithms with the jetson-inference examples more precisely the SegNet.
When I remove the shift layer from the deploy.prototxt and apply it in the gpuPreImageNet https://github.com/dusty-nv/jetson-inference/blob/master/imageNet.cu#L43 the conv1 execution time in TensorRT explodes.
The gpuPreImageNet is faster than the shift layer that’s why I want to remove it.

But I don’t understand why there is a difference. Does TensorRT requires a specific layer in the deploy to format the data ?

  • Here is the timing without the shift layer
[GIE]  layer conv1 + relu1 - 100.882431 ms
[GIE]  layer pool1 - 0.922624 ms
[GIE]  layer norm1 - 0.389120 ms
[GIE]  layer conv2 + relu2 - 11.571200 ms
[GIE]  layer pool2 - 0.616448 ms
[GIE]  layer norm2 - 0.318464 ms
[GIE]  layer conv3 + relu3 - 6.608896 ms
[GIE]  layer conv4 + relu4 - 3.917824 ms
[GIE]  layer conv5 + relu5 - 2.490368 ms
[GIE]  layer pool5 - 0.160768 ms
[GIE]  layer fc6 + relu6 - 74.227715 ms
[GIE]  layer fc7 + relu7 - 32.695297 ms
[GIE]  layer score_fr - 0.507904 ms
[GIE]  layer network time - 235.309067 ms
  • And with the shift layer :
[GIE]  layer shift - 9.828352 ms
[GIE]  layer conv1 + relu1 - 13.209600 ms
[GIE]  layer pool1 - 0.885760 ms
[GIE]  layer norm1 - 1.014784 ms
[GIE]  layer conv2 + relu2 - 9.421824 ms
[GIE]  layer pool2 - 0.588800 ms
[GIE]  layer norm2 - 0.309248 ms
[GIE]  layer conv3 + relu3 - 5.115904 ms
[GIE]  layer conv4 + relu4 - 5.628928 ms
[GIE]  layer conv5 + relu5 - 2.464768 ms
[GIE]  layer pool5 - 0.193536 ms
[GIE]  layer fc6 + relu6 - 73.565186 ms
[GIE]  layer fc7 + relu7 - 34.207745 ms
[GIE]  layer score_fr - 0.902144 ms
[GIE]  layer network time - 157.336578 ms

There is a 90ms difference which is huge !

My testing are done on a ubuntu 16.04 with a GTX 1070 with TensorRT 2.1 + Cuda 8 and TensorRT 3.0 + Cuda 9.

Thanks for your help.

Hi Austriker, which model are you using when you run with the shift layer? Had you trained it yourself, or is it from the repo.

Hi Dusty,

I have tested with my model fined tuned like the example on the jetson-inference repo.
I have also trained my model without the shift layer but with a pycaffe layer to normalise the layer between -1 & 1 (I have also tested with tensorRT but I have the same result).

On Caffe the conv1 timing doesn’t change with or without the shift.

OK good, please use your own fine-tuned models in the comparison to avoid any other differences in training.

Normally Python layers wouldn’t be supported by TensorRT, are those making it into the runtime network?

To be more precise :

I have trained my model using pycaffe. I have removed the shift layer during the training and applied the normalisation in a python layer :

img = img * 2 / 255 - 1 // to simplify => values around -1 et 1

So in the segnet kernel gpuPreImageNet I have :

const float4 px  = input[ dy * iWidth + dx ];
	const float3 bgr = make_float3(px.z, px.y, px.x);
	
	output[n * 0 + y * oWidth + x] = bgr.x * 2 / 255 - 1;
	output[n * 1 + y * oWidth + x] = bgr.y * 2 / 255 - 1;
	output[n * 2 + y * oWidth + x] = bgr.z * 2 / 255 - 1;

And here is the deploy that make the conv1 explode :

input: "data"
input_shape {
  dim: 1
  dim: 3
  dim: 1028
  dim: 1232
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 96
    pad: 0
    kernel_size: 11
    group: 1
    stride: 4
  }
...
layer {
  name: "score_fr"
  type: "Convolution"
  bottom: "fc7"
  top: "score_fr"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 3
    pad: 0
    kernel_size: 1
  }
}

And here is the deploy which keeps the conv1 Ok but I lose 10ms in a useless layer :

input: "data"
input_shape {
  dim: 1
  dim: 3
  dim: 1028
  dim: 1232
}
layer {
  name: "shift"
  type: "Power"
  bottom: "data"
  top: "data_preprocessed"
  power_param {
    shift: 0
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data_preprocessed"
  top: "conv1"
  convolution_param {
    num_output: 96
    pad: 0
    kernel_size: 11
    group: 1
    stride: 4
  }
}
...
layer {
  name: "score_fr"
  type: "Convolution"
  bottom: "fc7"
  top: "score_fr"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 3
    pad: 0
    kernel_size: 1
  }
}

The output of the network is correct for both tests.

Hi,

I have tested the same model on a Jetson Tx2 with TrT 2.1. I have the same result, the conv1 execution time explodes without the shift layer even if the input data is normalized in the gpuPreImageNet.

There is no python in the runtime network only c++.

Hi,

Is there also a reason why if am using a plugin to replace for example FC6 and FC7 layers. The rest of the network execution time is multiplied by 2 ?

Hi Austriker, perhaps it is related to FP32/FP16? Also is plugin processing on CPU or GPU?

Hi Dusty,

If you have a plugin activated all the other layers are running on FP32 ? The layer is running on the GPU.
Regarding the first issue with the shift layer, there is no active plugin when I do the tests. It’s the jetson-inference repo.

The plugin will be FP32, but TensorRT should normally add unzip/zip layers to convert between FP16 and back again.

If you enable layer profiling mode, it will print out the layer names so you can confirm, see here:

[url]jetson-inference/tensorNet.h at 8ed492bfdc9e1b98f7711cd5ae224ce588635656 · dusty-nv/jetson-inference · GitHub

No I don’t see a zip/unzip layer.

Layer fc6 and fc7 are running with the plugin.

[GIE]  layer shift - 4.834336 ms
[GIE]  layer conv1 + relu1 input reformatter 0
[GIE]  layer conv1 + relu1
[GIE]  layer pool1
[GIE]  layer norm1
[GIE]  layer conv2 + relu2
[GIE]  layer pool2
[GIE]  layer norm2
[GIE]  layer conv3 + relu3
[GIE]  layer conv4 + relu4
[GIE]  layer conv5 + relu5
[GIE]  layer pool5
[GIE]  layer fc6 input reformatter 0
[GIE]  layer fc6
[GIE]  layer relu6
[GIE]  layer fc7
[GIE]  layer relu7
[GIE]  layer score_fr

Do you have any idea why the first conv1 without the shift layer explodes in computation time ?

In this case, it looks like the operation is called “reformatter” now. This probably corresponds to FP16 conversion because it happens near the beginning and at your plugin. My guess as to why it doesn’t convert back to FP16 for the remaining few layers is because it determines it’s not worth it.

I’m not exactly sure, I sent the particular question to some colleagues for review.
For now I would keep using it with the zero shift layer (as redundant as that is).

But it’s take twice the time for conv1 to 5 without FP16. Is it possible to force in FP16 ?

Thank you for your feedback.

Hi,

We have checked the fcn_alexnet w/o shift layer and didn’t find anything strange.

With shift:

shift input reformatter 0                4.853ms
shift                                    2.347ms
conv1.pruned + relu1                     13.126ms
pool1                                    0.705ms
norm1                                    0.281ms
conv2.pruned + relu2                     5.279ms
pool2                                    0.273ms
norm2                                    0.096ms
conv3.pruned + relu3                     0.646ms
conv4.pruned + relu4                     1.129ms
conv5.pruned + relu5                     0.459ms
pool5                                    0.036ms
fc6.pruned + relu6                       1.613ms
fc7.pruned + relu7                       4.912ms
score_fr_21classes.pruned                0.604ms
score_fr_21classes.pruned output reforma 0.020ms
Time over all layers: 36.381

Without shift:

conv1.pruned + relu1 input reformatter 0 4.848ms
conv1.pruned + relu1                     13.132ms
pool1                                    0.705ms
norm1                                    0.283ms
conv2.pruned + relu2                     5.281ms
pool2                                    0.274ms
norm2                                    0.096ms
conv3.pruned + relu3                     0.649ms
conv4.pruned + relu4                     1.131ms
conv5.pruned + relu5                     0.459ms
pool5                                    0.036ms
fc6.pruned + relu6                       1.614ms
fc7.pruned + relu7                       4.925ms
score_fr_21classes.pruned                0.608ms
score_fr_21classes.pruned output reforma 0.020ms
Time over all layers: 34.062

The profiling results look normal. And the model without shift layer also has the better performance.

We use default TensorRT sample to profile the model performance.
Could you also check if your issue can be reproduced with the native TensorRT sample?

Thanks

Hi all,

I have been testing with the same example as you did.
But after launching 3 times & having stable results, the following sessions were not stable at all.
Here is the result.
Do you know why ?

1st test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 25.0843 ms.
Average over 10 runs is 25.1267 ms.
Average over 10 runs is 25.1284 ms.
Average over 10 runs is 25.1203 ms.
Average over 10 runs is 25.1188 ms.
Average over 10 runs is 25.1187 ms.
Average over 10 runs is 25.1225 ms.
Average over 10 runs is 25.1531 ms.
Average over 10 runs is 25.1162 ms.
Average over 10 runs is 25.113 ms.
Average over 10 runs is 25.1134 ms.
Average over 10 runs is 25.1206 ms.
Average over 10 runs is 25.1131 ms.
Average over 10 runs is 25.1053 ms.
Average over 10 runs is 25.1135 ms.
Average over 10 runs is 25.1162 ms.
Average over 10 runs is 25.1138 ms.
Average over 10 runs is 25.1356 ms.
Average over 10 runs is 25.2261 ms.
Average over 10 runs is 25.124 ms.

2nd test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 25.0894 ms.
Average over 10 runs is 25.1149 ms.
Average over 10 runs is 25.0964 ms.
Average over 10 runs is 25.1101 ms.
Average over 10 runs is 25.0913 ms.
Average over 10 runs is 25.1021 ms.
Average over 10 runs is 25.0962 ms.
Average over 10 runs is 25.1053 ms.
Average over 10 runs is 25.0943 ms.
Average over 10 runs is 25.0939 ms.
Average over 10 runs is 25.0825 ms.
Average over 10 runs is 25.0944 ms.
Average over 10 runs is 25.0849 ms.
Average over 10 runs is 25.1064 ms.
Average over 10 runs is 25.0914 ms.
Average over 10 runs is 25.0953 ms.
Average over 10 runs is 25.0917 ms.
Average over 10 runs is 25.1148 ms.
Average over 10 runs is 25.092 ms.
Average over 10 runs is 25.0918 ms.

3rd test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 25.1215 ms.
Average over 10 runs is 25.1118 ms.
Average over 10 runs is 25.1063 ms.
Average over 10 runs is 25.1124 ms.
Average over 10 runs is 25.1224 ms.
Average over 10 runs is 25.1232 ms.
Average over 10 runs is 25.1098 ms.
Average over 10 runs is 25.098 ms.
Average over 10 runs is 25.1085 ms.
Average over 10 runs is 25.1032 ms.
Average over 10 runs is 25.1056 ms.
Average over 10 runs is 25.1097 ms.
Average over 10 runs is 25.1086 ms.
Average over 10 runs is 25.1098 ms.
Average over 10 runs is 25.1187 ms.
Average over 10 runs is 25.108 ms.
Average over 10 runs is 25.092 ms.
Average over 10 runs is 25.1126 ms.
Average over 10 runs is 25.1069 ms.
Average over 10 runs is 25.0928 ms.

and then from 4th test, it’s weird

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 36.2181 ms.
Average over 10 runs is 36.2666 ms.
Average over 10 runs is 36.4036 ms.
Average over 10 runs is 36.5466 ms.
Average over 10 runs is 28.5801 ms.
Average over 10 runs is 17.7351 ms.
Average over 10 runs is 17.7327 ms.
Average over 10 runs is 17.7302 ms.
Average over 10 runs is 17.7099 ms.
Average over 10 runs is 17.7183 ms.
Average over 10 runs is 17.7139 ms.
Average over 10 runs is 17.7292 ms.
Average over 10 runs is 17.7235 ms.
Average over 10 runs is 17.7264 ms.
Average over 10 runs is 17.7215 ms.
Average over 10 runs is 17.7217 ms.
Average over 10 runs is 17.7211 ms.
Average over 10 runs is 17.7591 ms.
Average over 10 runs is 17.7206 ms.
Average over 10 runs is 17.7162 ms.

5th test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 29.4871 ms.
Average over 10 runs is 29.5042 ms.
Average over 10 runs is 29.6686 ms.
Average over 10 runs is 29.6803 ms.
Average over 10 runs is 29.6761 ms.
Average over 10 runs is 29.6828 ms.
Average over 10 runs is 29.691 ms.
Average over 10 runs is 29.6844 ms.
Average over 10 runs is 29.6635 ms.
Average over 10 runs is 29.6998 ms.
Average over 10 runs is 29.6705 ms.
Average over 10 runs is 29.6869 ms.
Average over 10 runs is 29.673 ms.
Average over 10 runs is 29.6667 ms.
Average over 10 runs is 29.665 ms.
Average over 10 runs is 29.6779 ms.
Average over 10 runs is 29.6836 ms.
Average over 10 runs is 29.7064 ms.
Average over 10 runs is 29.6848 ms.
Average over 10 runs is 29.6697 ms.

and last one

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 36.2083 ms.
Average over 10 runs is 36.3211 ms.
Average over 10 runs is 36.3594 ms.
Average over 10 runs is 36.5478 ms.
Average over 10 runs is 36.6804 ms.
Average over 10 runs is 36.7456 ms.
Average over 10 runs is 36.76 ms.
Average over 10 runs is 36.6153 ms.
Average over 10 runs is 36.6486 ms.
Average over 10 runs is 36.6631 ms.
Average over 10 runs is 36.6494 ms.
Average over 10 runs is 36.6599 ms.
Average over 10 runs is 36.6613 ms.
Average over 10 runs is 36.6458 ms.
Average over 10 runs is 36.658 ms.
Average over 10 runs is 36.6578 ms.
Average over 10 runs is 36.6765 ms.
Average over 10 runs is 36.6538 ms.
Average over 10 runs is 36.6606 ms.
Average over 10 runs is 36.6612 ms.

Here is the same test as you did but with no shift.
more strange, it begins to be différent at the 2nd test

1st test :

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet_noshift.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet_noshift.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 28.8686 ms.
Average over 10 runs is 29.0036 ms.
Average over 10 runs is 29.0754 ms.
Average over 10 runs is 29.0555 ms.
Average over 10 runs is 29.0621 ms.
Average over 10 runs is 29.0685 ms.
Average over 10 runs is 29.0662 ms.
Average over 10 runs is 29.0507 ms.
Average over 10 runs is 29.0868 ms.
Average over 10 runs is 29.0465 ms.
Average over 10 runs is 29.05 ms.
Average over 10 runs is 29.0682 ms.
Average over 10 runs is 29.0535 ms.
Average over 10 runs is 29.0444 ms.
Average over 10 runs is 29.0517 ms.
Average over 10 runs is 29.0821 ms.
Average over 10 runs is 29.058 ms.
Average over 10 runs is 29.0507 ms.
Average over 10 runs is 29.0638 ms.
Average over 10 runs is 29.0667 ms.

2nd test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet_noshift.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet_noshift.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 67.7145 ms.
Average over 10 runs is 67.9457 ms.
Average over 10 runs is 68.2224 ms.
Average over 10 runs is 68.6598 ms.
Average over 10 runs is 68.9016 ms.
Average over 10 runs is 68.9204 ms.
Average over 10 runs is 68.9528 ms.
Average over 10 runs is 68.8943 ms.
Average over 10 runs is 68.8744 ms.
Average over 10 runs is 68.861 ms.
Average over 10 runs is 68.9043 ms.
Average over 10 runs is 68.8993 ms.
Average over 10 runs is 68.9025 ms.
Average over 10 runs is 68.9083 ms.
Average over 10 runs is 68.9315 ms.
Average over 10 runs is 68.8906 ms.
Average over 10 runs is 68.8469 ms.
Average over 10 runs is 68.9103 ms.
Average over 10 runs is 68.8616 ms.
Average over 10 runs is 68.9034 ms.

3rd test

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./giexec --deploy=fcn_alexnet_noshift.deploy.prototxt --output=score_fr_21classes --iterations=20
deploy: fcn_alexnet_noshift.deploy.prototxt
output: score_fr_21classes
iterations: 20
Input "data": 3x720x1280
Output "score_fr_21classes": 21x17x34
name=data, bindingIndex=0, buffers.size()=2
name=score_fr_21classes, bindingIndex=1, buffers.size()=2
Average over 10 runs is 35.3226 ms.
Average over 10 runs is 35.5611 ms.
Average over 10 runs is 35.6415 ms.
Average over 10 runs is 35.7841 ms.
Average over 10 runs is 35.7704 ms.
Average over 10 runs is 35.7469 ms.
Average over 10 runs is 35.7926 ms.
Average over 10 runs is 35.7469 ms.
Average over 10 runs is 35.7632 ms.
Average over 10 runs is 35.7942 ms.
Average over 10 runs is 35.7725 ms.
Average over 10 runs is 35.7559 ms.
Average over 10 runs is 35.759 ms.
Average over 10 runs is 35.7711 ms.
Average over 10 runs is 35.7539 ms.
Average over 10 runs is 35.7497 ms.
Average over 10 runs is 35.7458 ms.
Average over 10 runs is 35.8039 ms.
Average over 10 runs is 35.7723 ms.
Average over 10 runs is 35.7623 ms.

Looking more precisely in the layer times, we see that it seems the lost time is spread in all layers :
(this has been done in noshift)

1st test

layer conv1.pruned + relu1 - 11.2768 ms
layer pool1 - 0.80272 ms
layer norm1 - 0.20752 ms
layer conv2.pruned + relu2 - 8.36192 ms
layer pool2 - 0.350208 ms
layer norm2 - 0.090912 ms
layer conv3.pruned + relu3 - 0.82672 ms
layer conv4.pruned + relu4 - 1.51632 ms
layer conv5.pruned + relu5 - 0.56272 ms
layer pool5 - 0.03968 ms
layer fc6.pruned + relu6 - 2.65888 ms
layer fc7.pruned + relu7 - 7.69216 ms
layer score_fr_21classes.pruned - 0.8352 ms
layer network time - 35.2218 ms
Average over 10 runs is 35.2203 ms.

2nd Test

layer conv1.pruned + relu1 - 14.6209 ms
layer pool1 - 1.05878 ms
layer norm1 - 0.27104 ms
layer conv2.pruned + relu2 - 11.0164 ms
layer pool2 - 0.461184 ms
layer norm2 - 0.11504 ms
layer conv3.pruned + relu3 - 1.08976 ms
layer conv4.pruned + relu4 - 1.99466 ms
layer conv5.pruned + relu5 - 0.743744 ms
layer pool5 - 0.05072 ms
layer fc6.pruned + relu6 - 3.49584 ms
layer fc7.pruned + relu7 - 10.1112 ms
layer score_fr_21classes.pruned - 1.05504 ms
layer network time - 46.0843 ms
Average over 10 runs is 46.1966 ms.

Hi,

Could you also monitor the tegrastats results and share with us?

sudo ./tegrastats

Thanks.

Hi,

Have you tried tegrastats?

Hi,

We have changed several things in our code.
I also switched from cuda8/TRT2 to cuda9/TRT3.
This seems to have a different behaviour.
With cuda9/TRT3 it seems that there is no more lag at the beginning so maybe it s not a pbm anymore with TRT3.

This week im on a Business Trip so hard to test again ut i will try.
I will for sure answer the week after if its ok for you.