Convolutions are slow on this hardware. Based on a 256x256x3 input image (format NHWC), convolved against a 3x3x64 filter (format KCHW), output tensor 256x256x64 (format NCHW), applied with [cudnnConvolutionForward()]. I am applying biases to the convolution output [cudnnAddTensor()], and performing RELU on the final output [cudnnActivationForward()].
This convolution is taking 10.73ms.
Is this the expected performance for such a small convolution, or are there optimisation tricks I am missing?
I’m assuming that having external memory outside the Tegra K1 GPU is a serious bottleneck in these kinds of operations.
The question is, how much of a significant performance boost would I get executing this same operation on a Tegra X1 or X2 (on Jetson boards)? I’m assuming the FP16 on the TX1/TX2 and onboard GPU memory is going to significantly improve performance on these operations.
I can’t tell the exactly performance improvement you can get as I didn’t try it on all platforms, but as our supporting at the other topic you posted, it’s recommended to use TX1 or TX2 for deep learning use case.