Please help,
First of all, I apologize for re-posting this question from another section of forums - I think the question might be better asked here. I will properly place my question(s) next time.
I have a performance question regarding cuFFT using Complex-to-Complex forward FFT on 1D array - no errors or unexpected data, just performance question.
The observed performance for cuFFT forward FFT drops significantly when the array length is 22,097,157 (4,194,314), while array sizes 21,048,576 (2,097,152) and 2*4,194,304 (8,388,608) perform as expected. Is this an issue with FFT in general, an artifact of the employed algorithms of FFT?
example timing where array size is N:
N – DEVICE Time (ms.)
2,097,152 – 0.96
4,194,314 – 253.50
8,388,608 – 2.15
I am using a single GPU (Titan V) with compute architecture 10.1. A sample of the code I am using follows:
#include <cufft.h>
#include <stdio.h>
// I left off CUDA timing and error handling as I just want to
// know if there is something I am doing wrong with calling the
// cuFFT library
int main(){
// length of array - when N is 2097157 the performance is
// significantly worse than either 1048576 or 4194304
const unsigned int N = 2097157;
cuComplex *darray, *harray, *result;
harray = (cuComplex*)malloc(2*N*sizeof(cuComplex);
result = (cuComplex*)malloc(2*N*sizeof(cuComplex);
cudaMalloc((void**)&darray, 2*N*sizeof(cuComplex));
// initialize
for(unsigned int i = 0; i < 2*N; ++i){
harray[i].x = (float)i;
harray[i].y = 1.0f;
}
// copy to DEVICE
cudaMemcpy(darray, harray, 2*N*sizeof(cuComplex), cudaMemcpyHostToDevice);
// Didn't wrap these calls in error macros
cufftHandle plan;
cufftPlan1d(&plan, 2*N, CUFFT_C2C, 1);
cufftExecC2C(plan, darray, darray, CUFFT_FORWARD);
cufftDestroy(plan);
// copy to HOST
cudaMemcpy(result, darray, 2*N*sizeof(cuComplex), cudaMemcpyDeviceToHost);
free(harray);
free(result);
cudaFree(darray);
return 0;
}
Any ideas or hints as to why this behavior occurs would be great.
Thank you