Reproducibility of the results produced by cuFFT library

Coming from FFTW world, I have some concerns about deterministic behavior of cuFFT library and reproducibility of the results.

When you use FFTW, you can get slightly different results (subject to round off errors) for the same input data for identical transforms. It happens because FFTW chooses the best plan time at run time and the plan may be different from run to run even on the same machine. To solve this problem, FFTW has capability to precompute plans, save them, and later load them. In this case, you have 100% reproducible results even on different machines.

Speaking about cuFFT, there is no option to save plans. Therefore, I have two questions

  1. Will results of FFT transformations be identical for all supported devices, if cuFFT version is fixed?
  2. If answer for the first question is “no”, will results be identical for the same compute architecture and fixed cuFFT version?

I can’t speak for NVIDIA, but I think in general it is unlikely that the exact same code path is used for all supported devices, given a specific input configuration. It is more likely that at least some device-specific code paths exist for optimal performance across the four architectures currently supported (Fermi, Kepler, Maxwell, Pascal). Even on CPUs there are often multiple code paths (e.g. x87 / SSE / AVX). So the likely answer to your first question is “no”, there is no such guarantee.

On the other hand, it seems likely that the answer to the second question is “yes”: For a specific combination of input configuration, GPU architecture, and library version the same plan is produced every time.

Given that there can also be numerical differences between different versions of the CUFFT library (and more generally, any mathematical library), I am wondering in which context the bit-wise reproducibility your are looking for is considered important.