Grouped convolutions are required in MobileNet. I realize that cuDNN has included support for fast fp16 grouped convolutions. Unfortunately, my benchmarks seems to contradict this fact.
I am running a convolution with N = 1, C = 64, H = 56, W = 56, K = 64, groups = 64 (depthwise convolution)
One the Jetson nano, with cuDNN that comes with JetPack, I get a runtime of 1.5 ms for HALF, which is slower than 1.0 for FLOAT for the regular non-grouped convolution (which poses its own question: when is half faster than float?)
For grouped convolution, HALF gives me a runtime of 10.4 ms… while FLOAT gives a runtime of 12 ms. I believe this is not possible given the official inference benchmarks with MobileNet-SSD. What’s going on here? I am using the following to set my grouped convolutions:
int group_count = 64;
checkCUDNN(cudnnSetConvolutionGroupCount(
convolution_descriptor,group_count)
);
I set the in_channel in the filter descriptor to 1.
Is cuDNN slower than TensorRT?