CUDA FFT different from Matlab FFT

Hi!

I’m porting a Matlab application to CUDA. I need to calculate FFT by cuFFT library, but results between Matlab fft() and CUDA fft are different. I’ve seen around this forum ( http://forums.nvidia.com/index.php?showtop…mp;#entry589016 ) and others, that the problems resides in a different representation of matrix (row-major orderin CUDA vs column-major order in Matlab) and in a different representation of complex values (in CUDA we have a Array of Struct, while in Matlab we have a Struct of Array). The Matlab instruction that I’m trying to convert:

% A is a 1013*256-matrix

% B will be a 1024*256-matrix

B = fft(A, 1024, 1);

So I’ve made these trials:

  • n-point cudaFFT on A in 1-dimension
cufftHandle plan;

cudaPlan1d(&plan,1024*256,CUFFT_C2C,1);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);
  • cudaFFT on A in 2-dimension
cufftHandle plan;

cudaPlan2d(&plan,1024,256,CUFFT_C2C);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);
  • n-point cudaFFT on A^t (the transpose of A) in 1-dimension
cufftHandle plan;

cudaPlan1d(&plan,1024*256,CUFFT_C2C,1);

cufftExecC2C(plan,At,B,CUFFT_FORWARD);
  • cudaFFT on A in 2-dimension with args N and M swapped
cufftHandle plan;

cudaPlan2d(&plan,256,1024,CUFFT_C2C);

cufftExecC2C(plan,A,B,CUFFT_FORWARD);

None of these gave me results more or less equals to Matlab fft(). What can I do?

if you divide the GPU FFT results by the size of the FFT, the results should match between matlab & CUDA.

if you divide the GPU FFT results by the size of the FFT, the results should match between matlab & CUDA.

Ok, I will try. Thanks. ;)

Ok, I will try. Thanks. ;)

I’ve tried but without success. But which of the above transforms do I apply? And do I need to pass the transpose of A? FFT size is 1024*256, right?

I’ve tried but without success. But which of the above transforms do I apply? And do I need to pass the transpose of A? FFT size is 1024*256, right?

Look at the code in [url=“http://developer.download.nvidia.com/compute/cuda/1_1/Matlab_Cuda_1.1.tgz”]http://developer.download.nvidia.com/compu...ab_Cuda_1.1.tgz[/url].

You need to use the 2D transform, by the way.

Look at the code in [url=“http://developer.download.nvidia.com/compute/cuda/1_1/Matlab_Cuda_1.1.tgz”]http://developer.download.nvidia.com/compu...ab_Cuda_1.1.tgz[/url].

You need to use the 2D transform, by the way.

I’ve already seen that, but I don’t know how adjust it for my program. Recapitulating, the situation is the following:

  • I’ve defined some variables: Nblock = 1014, Ncell = 256

  • Initially, I launch a kernel that outputs an (Nblock-1)-by-Ncell matrix (truly, it’s a flattened cufftComplex array of (Nblock-1)*Ncell elements). This array, named snap_shot, has the same values as the one produced by Matlab (in Matlab it’s just an (Nblock-1)-by-Ncell matrix, not an array). So there’s no need to “pack” complex values as in Matlab_Cuda_1.1 plugin, right?

  • After kernel, I need to launch an FFT on this array. On Matlab the following instruction is used (figuring out a 1024-by-256 matrix):

temp_fft = fft(snap_shot, 2^nextpow2(Nblock), 1);   % 2^nextpow2(Nblock)==1024

Instead, on CUDA I’ve launched one of the following cuFFT:

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, Ncell*nxtPow2Nblock*sizeof(cufftComplex));

...

cufftHandle plan;

cufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

or

cufftHandle plan;

cufftPlan1d(&plan,Ncell*nxtPow2Nblock,CUFFT_C2C,1);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

All of these gives me different results compared with Matlab ones. As suggested here, I’ve also tried to divide FFT results by the size of FFT (which is nxtPow2Nblock*Ncell, right?); however, I always have different results from Matlab. I don’t know what to do.

I’ve already seen that, but I don’t know how adjust it for my program. Recapitulating, the situation is the following:

  • I’ve defined some variables: Nblock = 1014, Ncell = 256

  • Initially, I launch a kernel that outputs an (Nblock-1)-by-Ncell matrix (truly, it’s a flattened cufftComplex array of (Nblock-1)*Ncell elements). This array, named snap_shot, has the same values as the one produced by Matlab (in Matlab it’s just an (Nblock-1)-by-Ncell matrix, not an array). So there’s no need to “pack” complex values as in Matlab_Cuda_1.1 plugin, right?

  • After kernel, I need to launch an FFT on this array. On Matlab the following instruction is used (figuring out a 1024-by-256 matrix):

temp_fft = fft(snap_shot, 2^nextpow2(Nblock), 1);   % 2^nextpow2(Nblock)==1024

Instead, on CUDA I’ve launched one of the following cuFFT:

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, Ncell*nxtPow2Nblock*sizeof(cufftComplex));

...

cufftHandle plan;

cufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

or

cufftHandle plan;

cufftPlan1d(&plan,Ncell*nxtPow2Nblock,CUFFT_C2C,1);

...

cufftExecC2C(plan,snap_shot,temp_fft,CUFFT_FORWARD);

All of these gives me different results compared with Matlab ones. As suggested here, I’ve also tried to divide FFT results by the size of FFT (which is nxtPow2Nblock*Ncell, right?); however, I always have different results from Matlab. I don’t know what to do.

You definitely have to do a 2D FFT. When you say you have different results with Matlab what do you see?

for example: f2 = fftn(ref20); in Matlab,
CUDA equivalent:
cufftHandle plan_forward2;
CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward2, pix1, pix2, CUFFT_C2C)); //pix1 and pix2 are the size of the FFT in x and y dimension
CUFFT_SAFE_CALL(cufftExecC2C(plan_forward2, in2_d, f2_d, CUFFT_FORWARD));
CUFFT_SAFE_CALL(cufftDestroy(plan_forward2));

You definitely have to do a 2D FFT. When you say you have different results with Matlab what do you see?

for example: f2 = fftn(ref20); in Matlab,
CUDA equivalent:
cufftHandle plan_forward2;
CUFFT_SAFE_CALL(cufftPlan2d(&plan_forward2, pix1, pix2, CUFFT_C2C)); //pix1 and pix2 are the size of the FFT in x and y dimension
CUFFT_SAFE_CALL(cufftExecC2C(plan_forward2, in2_d, f2_d, CUFFT_FORWARD));
CUFFT_SAFE_CALL(cufftDestroy(plan_forward2));

I’ve already made that, but without success. What do you mean for “When you say you have different results with Matlab what do you see”? Simply, I see that the results between CUDA and Matlab are not equals (or at least they are not similar).

I’ve already made that, but without success. What do you mean for “When you say you have different results with Matlab what do you see”? Simply, I see that the results between CUDA and Matlab are not equals (or at least they are not similar).

But are the results equal considering that matlab is double precision and cuda single precision?

But are the results equal considering that matlab is double precision and cuda single precision?

I’ve also tried with cufftDoubleComplex type and with cufftExecZ2Z; it doesn’t seem a precision problem.

This is a snippet of my main():

int main(int argc, char *argv[]) {

...

float2 *dAmb;

cudaMalloc((void **) &dAmb, nxtPow2Nblock*Ncell*sizeof(float2));

cufftComplex *dsnap_shot;

cudaMalloc((void **) &dsnap_shot, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftHandle plan;

icufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

float2 *hAmb;

cudaMallocHost((void **) &hAmb, nxtPow2Nblock*Ncell*sizeof(float2));

...

ambfunc_kern<<<numBlocks,numThreads>>>(dhNc,dhNblock,Nblock,block_size,dsnap_shot);  // After this dsnap_shot will have correct values

cudaThreadSynchronize();

cufftExecC2C(plan,dsnap_shot,temp_fft,CUFFT_FORWARD);  // After this temp_fft has wrong values

cudaThreadSynchronize();

fftshift_kern<<<numBlocks,numThreads>>>(temp_fft,nxtPow2Nblock*Ncell,nxtPow2Nblock,block_size,dA

mb);	

cudaMemcpy(hAmb,dAmb,nxtPow2Nblock*Ncell*sizeof(float2),cudaMemcpyDeviceToHost);

...

}

I’ve also tried with cufftDoubleComplex type and with cufftExecZ2Z; it doesn’t seem a precision problem.

This is a snippet of my main():

int main(int argc, char *argv[]) {

...

float2 *dAmb;

cudaMalloc((void **) &dAmb, nxtPow2Nblock*Ncell*sizeof(float2));

cufftComplex *dsnap_shot;

cudaMalloc((void **) &dsnap_shot, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftComplex *temp_fft;

cudaMalloc((void **) &temp_fft, nxtPow2Nblock*Ncell*sizeof(cufftComplex));

cufftHandle plan;

icufftPlan2d(&plan,Ncell,nxtPow2Nblock,CUFFT_C2C);

float2 *hAmb;

cudaMallocHost((void **) &hAmb, nxtPow2Nblock*Ncell*sizeof(float2));

...

ambfunc_kern<<<numBlocks,numThreads>>>(dhNc,dhNblock,Nblock,block_size,dsnap_shot);  // After this dsnap_shot will have correct values

cudaThreadSynchronize();

cufftExecC2C(plan,dsnap_shot,temp_fft,CUFFT_FORWARD);  // After this temp_fft has wrong values

cudaThreadSynchronize();

fftshift_kern<<<numBlocks,numThreads>>>(temp_fft,nxtPow2Nblock*Ncell,nxtPow2Nblock,block_size,dA

mb);	

cudaMemcpy(hAmb,dAmb,nxtPow2Nblock*Ncell*sizeof(float2),cudaMemcpyDeviceToHost);

...

}

Any other suggestions?