Unexpectedly low performance of cuFFT with half floating point (FP16)

ClaudioC · June 16, 2017, 9:03am

With NVIDIA GPUs that offer full support to half floating point (FP16) I was expecting a 2x processing time performance boost with FP16 compared to single precision floating point (FP32).

I have run repeatable benchmarks in controlled conditions with NVIDIA Tesla P100 and Jetson TX2 and in both cases the 2x performance boost is only available with very small FFT sizes, resp smaller than 2^10 with P100 and 2^13 with Jetson TX2.

Full results, together with the source code of the benchmark run, are available in this public git repository:

https://bitbucket.org/ccicconetti/mbi_cuda_snippets/src/4d250095f5d7/BenchmarkFp16/?at=master

A direct link to the P100 vs Jetson TX2 results is:

External Image

Has anybody stumbled upon similar results when experimenting with FP16?

njuffa · June 16, 2017, 6:19pm

To the best of my knowledge, large FFTs are always limited by memory throughput, not by compute throughput. While using a narrower data type should theoretically result in higher “effective” memory throughput per second (measured in numbers/second rather than bytes/second), those narrower data types could also lead to reduced efficiency of memory accesses. So while I would not expect a 2x performance increase throughout, there should still be a meaningful incremental performance increase from using narrower data types.

Your graphs seem to be showing that this is not the case for large FFTs. This may indicate that the FFT code is not fully optimized for FP16 for large FFTs, either because these use cases have lesser importance in the market (*) or because there are technical difficulties (e.g. accuracy issues). I would suggest filing an RFE (request for enhancement) with NVIDIA, which you can via the bug reporting web form linked from the CUDA registered developer website. Simply prefix the synopsis with “RFE:” to mark it as an RFE rather than a functional bug.

i For any sufficiently large library it is economically infeasible to fully optimize all possible variants of a particular functionality. Which variants get the most attention from software developers is typically prioritized based on market demands[/i]