Hi, apparently NCHW is the preferred layout for data buffers in cuDNN. However, the framework I am using (includes CPU optimized routines) has all its data buffers stored in a NHWC manner. I do not want to loose CPU optimized code for scenarios where my users have no suitable GPU available.
Are there any significant performance penalties for NHWC that would make it worthwhile to convert to NCHW?
Are there performance differences between using 4d and Nd tensor descriptors?
Same question for 2d and Nd convolution descriptor?
Please consider this question as closed. As a matter of fact, I had to realize that the support for NHWC is incomplete in cuDNN. Surprising and very disappointing I might say. Since support for this was kind of announced in 2014 already for the v2: https://devtalk.nvidia.com/default/topic/783344/?comment=4664719