more information on maxwell_fp16_scudnn_winograd

ws1 · July 7, 2016, 1:31pm

Hi, nvprof shows maxwell_fp16_scudnn_winograd_fp16_128x128_mobile_tile148t_nt is the bottleneck in my Tx1.
Where to get more information of above API and its possible optimization?
I am running googlenet on Tx1 with cuda7 and cudnn5.

njuffa · July 7, 2016, 4:25pm

The name of the kernel suggests a matrix-matrix multiplication. Each library API call may map to any number of differently optimized kernels and this is especially so for matrix-matrix multiplication. NVIDIA typically does not reveal implementation of library API calls, and they may well change between GPU platforms and CUDA versions. Typical kernel selection criteria could be GPU architecture, matrix size, matrix aspect ratio, transpose modes.

The name of the kernel suggests that it is highly optimized: The Winograd algorithm is used for low complexity matrix multiplication, and NT mode is usually the fastest transpose mode combination. FP16 is used to maximize arithmetic throughput while minimizing global memory bandwidth requirements. The Maxwell version of the kernel is chosen in accordance with the GPU architecture of your platform.

As a general guideline, matrix multiplies tend to be fastest when (1) all matrices are (almost) square (2) all dimensions are multiples of 32 (3) matrices are large (4) no transpose on A but transpose on B; the already mentioned NT mode.

You could always file an RFE (request for enhancement) with NVIDIA for a more optimized implementation, but chances are this kernel is already highly optimized and there is limited upside.

ws1 · July 8, 2016, 4:56am

Thanks for your help. Is my understanding #3 and #4 below correct?
#1 cublasSgemmEx : “IO in FP16/FP32, computation in float”
#2 cublasHgemm: IO and computation all in fp16
#3 fp16_scudnn_winograd_fp16: IO and computation all in fp16 (scudnn does not mean fp32 compuation as in cublasSgemmEx)
#4 fp16_hcudnn_winograd_fp16: hcudnn does not exists

njuffa · July 8, 2016, 1:42pm

I don’t know. There is only so much one can reasonably speculate based just on a function name. While *GEMM follows a reasonably established naming BLAS pattern, it is unclear what * means in *cudnn, since CUDNN is simply the name of a library, not a particular function.

Lets assume that instead of speculating based on the name, we would would know precisely what this kernel does. How would that help in optimizing your application performance?

BTW, are you using the latest available version of CUDNN?

ws1 · July 10, 2016, 10:13pm

Hi Njuffa, thanks for your reply.
“scudnn_winograd” consumes 27% of total googlenet inference time on my tx1. If its computation is fp32, I hope fp16 computation will reduce 50% of its processing time.
I am using the latest cudnn5 supported by Tx1’s cuda7. Cudnn5.1 RC is not used because cuda7.5/8 is required.

nvprof

Time(%)      Time     Calls       Avg       Min       Max  Name
 27.33%  31.1518s      4024  7.7415ms  1.4776ms  27.286ms  maxwell_fp16_scudnn_winograd_fp16_128x128_mobile_tile148t_nt
 19.67%  22.4155s      6539  3.4280ms  717.61us  6.3306ms  void cudnn::detail::pooling_fw_4d_kernel<__half, float, ...)

njuffa · July 11, 2016, 4:23am

Speculating about possible speedups does not help, I think. For example, we don’t know whether this function is purely limited by FLOPS throughput. Assuming the kernel does currently use FP32 for internal computation we don’t know whether switching that to FP16 would cause the code to fail, either due to lack of accuracy in the results, or intermediate overflow or underflow. We also don’t know whether the function has already been optimized further in CUDNN 5.1. As far as I know, CUDNN is closed source, meaning NVIDIA are the only ones who can do anything about performance.

Therefore, I think one suitable course of action is to file an RFE against CUDNN for your particular use case, supplying a minimal repro case that demonstrates the performance you seek to improve, and pointing out the hot spots in the profiler results. NVIDIA can then take it from there. Either they’ll determine that further performance improvements are not practically feasible, or an improved version of this function will be targeted for a future versions of CUDNN (note that future release doesn’t necessarily mean the next release).

ws1 · July 12, 2016, 3:12am

Hi Njuffa, thanks for your helpful reply. Will follow RFE procedure you mentioned in link below
https://devtalk.nvidia.com/default/topic/822677/is-it-possible-to-memset-a-cudaarray-/?offset=2