Some conceptual questions on multidimensional cuFFT

Good evening, all.

I am extending some work I’m doing in FFT 1D to 2D and 3D, but came across some doubts in the API. The documentation at https://docs.nvidia.com/cuda/cufft/index.html left me unsure as to how to implement.
My basic declarations and definitions are (error checking omitted for easier reading):

// Declaring the cuFFT inputs, outputs and plans
cufftHandle fw_plan_2D, inv_plan_2D;
cufftComplex *complex_array;
cufftReal *input_array, *output_array;

// Explicitly saying in the variable names the directions
unsigned int x_rows, y_cols;

// Creating the forward and inverse plans
// Consider the x and y quantities already provided by user
cufftPlan2d(&fw_plan_2D, x_rows, y_cols, CUFFT_R2C);
cufftPlan2d(&inv_plan_2D, x_rows, y_cols, CUFFT_C2R);

// Allocate the 2D float and complex arrays as flattened to 1D
cudaMallocManaged(&input_array, sizeof(cufftReal) * x_rows * y_cols));
cudaMallocManaged(&output_array, sizeof(cufftReal) * x_rows * y_cols));
cudaMallocManaged(&complex_array, sizeof(cufftComplex) * x_rows * y_cols));

// Run the forward and inverse transforms with some work in between
// Consider the input array was already filled by another method
cufftExecR2C(fw_plan_2D, input_array, complex_array);
cudaThreadSynchronize();

// Do some work with the complex data

cufftExecC2R(inv_plan_2D, complex_array, input_array);
cudaThreadSynchronize();

// Do some more work with the complex data before deallocating everything

// Free the arrays
cudaFree(input_array);
cudaFree(output_array);
cudaFree(complex_array);

The code declares the arrays, plans, allocates memory, “runs” (in quotations marks because my questions are around here) the transforms and frees memory. So my questions are:

1 - My data is organized as “y_cols” (outter dimension) by “x_rows” (inner dimension). My data is organized as row-major. When we pass the arguments to cufftPlan2d, is the API considering the row-major nature of C/C++? Likewise, when we write the same in FORTRAN, the order of the arguments is the same but now treated as column-major, that is, “nx” is outter dimension and “ny” is inner?
2 - I must run “y_cols” transforms of length “x_rows”. However, I am not sure the way I call the transforms (lines 21 and 26) takes into consideration the parameters passed to plan creations in regards to “nx” and “ny”, due to question #1.
3 - Are the declaration and allocation of the complex array adequate for a 2D or 3D transform?
4 - Due to Hermitian symmetry, how should the stride be, as the complex array is flattened to 1D and the guides of the section 2.6 doesn’t mention this situation?

If you can assist with any of these questions or make any consideration, I will be grateful.

Yes, CUFFT assumes row-major data storage.

For your questions about R2C complex transforms, there are several questions on this forum that discuss this.

here is one:

[url]https://devtalk.nvidia.com/default/topic/826819/2d-cufft-wrong-result/[/url]

Thanks, txbob. This topic escaped my searches.
Your code probably answers these questions so I’m going to have a good look at it and adjust my work.

txbob, just a few question on the code of the referred topic:

  • The “fors” in lines 22 and 30, despite the indentation, are not inside the “if” in line 20, correct?
  • The logic in line 5 below, if sym_cols > j then cols is subtracted by j, is it the case of Hermitian symmetry? I thought it was be enough to just limit the iterations in j up to (cols / 2 + 1).
for (int i = 0; i < rows; i++)
    {
    for (int j = 0; j < cols; j++)
        if (j>=sym_cols)
            printf("%f ", data[i * sym_cols + (cols - j)].x);
        else
            printf("%f ", data[i * sym_cols + j].x);
    printf("\n");
    }

correct, that was sloppy of me. The indentation should be less confusing now.

The concept here is to give you printout that looks the same whether you specified R2C (in which case the actual data output is compressed taking advantage of hermitian symmetry) or C2C (in which case the data can be printed directly).

The underlying data is not 100% identical, but the point is to show that the equivalent of the C2C output can be generated from the (smaller) R2C output.

You could limit the iterations in j up to that limit. The printout would look different then. There is no right or wrong answer here, but I was trying to demonstrate that hermitian symmetry should not get in the way of using a R2C transform: you could create “exactly” the same output as you get from C2C, if you wished to.

Thank you, sir.
Now I have enough information to resume work.
Really appreciate your patience and guidance in these details.