cuFFT performance decrease with CUDA 7.0

Recently I upgraded my Linux box (with a GTX 750 Ti) from CUDA 6.5 to 7.0 and noticed a performance decrease in cuFFT.

For example, the simple program below reports an average runtime of 0.550ms with 6.5 and 0.770ms with 7.0.

Downgrading to 6.5 brings back the 0.550ms. Upgrading to 7.0 yields 0.770ms again.

Here is the code:

#include <math.h>
#include <stdio.h>
#include <sys/time.h>

#include <cuda.h>
#include <cufft.h>

#define PI 3.141592653589793
#define NFFT 128
#define BATCH 1000
#define NRUNS 1000

inline double seconds()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.e-6;
}

int main(int argc, char* argv[])
{
    cudaSetDevice(0);

    int n = BATCH * NFFT;
    cufftComplex* array = (cufftComplex*)malloc(n*sizeof(cufftComplex));
    for (int i=0; i<n; i++)
    {
        array[i].x = cos(2.*PI*i/NFFT);
        array[i].y = sin(2.*PI*i/NFFT);
    }

    cufftComplex* d_array;
    cudaMalloc(&d_array, n*sizeof(cufftComplex));
    cudaMemcpy(d_array, array, n*sizeof(cufftComplex), cudaMemcpyHostToDevice);

    double runtime = 0.0;
    for (int i=0; i<NRUNS+1; i++)
    {
        double t0 = seconds();

        cufftHandle plan;
        int nfft[] = {NFFT};
        cufftPlanMany(&plan, 1, nfft, NULL, 1, NFFT, NULL, 1, NFFT, CUFFT_C2C, BATCH);

        cufftComplex* d_fft;
        cudaMalloc(&d_fft, n*sizeof(cufftComplex));
        cufftExecC2C(plan, d_array, d_fft, CUFFT_FORWARD);

        cufftDestroy(plan);
        cudaFree(d_fft);

        if (i == 0) continue; // first run is just to warm up

        double t1 = seconds();
        runtime += t1-t0;
    }

    printf("Avg. runtime (s): %.6f\n", runtime/NRUNS);

    cudaFree(d_array);
    free(array);

    cudaDeviceReset();

    return 0;
}

Compiled with:
nvcc -arch=sm_50 -lcufft -o test test.cu

Supposedly, cuFFT should be faster in 7.0:
http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/
(see “cuFFT Performance Improvements”)

Am I doing something wrong? Or am I forgetting something?

I modified the code to time each function and here are the results:

+----------------+----------+----------+
| Runtimes (s)   | CUDA 6.5 | CUDA 7.0 |
+----------------+----------+----------+
| cufftPlanMany  | 0.000366 | 0.000590 | significant
| cudaMalloc     | 0.000061 | 0.000061 | negligible
| cufftExecC2C   | 0.000013 | 0.000013 | negligible
| cufftDestroy   | 0.000069 | 0.000068 | negligible
| cudaFree       | 0.000041 | 0.000037 | negligible
+----------------+----------+----------+
| Total          | 0.000550 | 0.000770 |
+----------------+----------+----------+

The issue can be narrowed down to cufftPlanMany, but why is plan generation slower in CUDA 7.0? Is it supposed to be?

“Please Note: In CUDA 7.0, the cuFFT library has a known issue that can lead to incorrect results for certain inputs sizes less than or equal to 1920 in any dimension when cufftSetStream() is passed a non-blocking stream (e.g., one created using the cudaStreamNonBlocking flag of the CUDA Runtime API or the CU_STREAM_NON_BLOCKING flag of the CUDA Driver API).”

Don’t know whether this is releated or not, but yeah, the current release of CUDA 7.0.28 is rather buggy, might have to wait for the next bugfix release.

First call into libcufft.so initializes the kernels inside the library. If you want to measure only plan generation you should discard time of the first iteration.