Hello,
Today I ported my code to use nVidia’s cuFFT libraries, using the FFTW interface API (include cufft.h instead, keep same function call names etc.)
What I found is that it’s much slower than before:
- 30hz using CPU-based FFTW
- 1hz using GPU-based cuFFTW
I have already tried enabling all cores to max, using:
The code flow is the same between the two variants. Could the speed difference be due to data alignment for SIMD?
- FFTW memory allocation was done using fft_malloc_complex
- cuFFTW memory allocation was done using regular C++ array initialization of Complex datatypes
Hi,
Could you lock frequency to the max and try it again?
sudo ~/jetson_clocks.sh
Thanks.
Thanks for the suggestion.
I’m now up to 2hz, instead of 1.
However, I was at 30 before with the regular CPU-based FFTW.
I have reworked my code to use the native cufft function calls thusly:
cudaMalloc((void**)&FFTout, sizeof(cufftComplex)*(BINSIZE/2 +1));
cudaMalloc((void**)&FFTin, sizeof(cufftReal)*(BINSIZE));
cufftPlan1d(&planFFT,BINSIZE, CUFFT_R2C, 1);
cudaMemcpy(FFTin, data, BINSIZE*sizeof(cufftReal), cudaMemcpyHostToDevice);
cufftExecR2C(planFFT, FFTin, FFTout);
cudaMemcpy(&output, FFTout, (BINSIZE/2+1)*sizeof(cufftComplex), cudaMemcpyDeviceToHost);
I am now observing 3hz total speed.
My best guess is that it’s due to the extra memory read/writes, due to the way that GPUs usually work.
However since the TX2 is sharing memory, is there a way to speed it up here without the extra cudaMemcpy?
For reference, my FFT size is 512, looped across many sets.
Hi,
Have you checked our cuFFT sample?
Complex *h_signal = (Complex *)malloc(sizeof(Complex) * SIGNAL_SIZE);
You can find this example at ‘/usr/local/cuda-8.0/samples/7_CUDALibraries/simpleCUFFT/’.
Thanks.
I tracked down the issue to a combination of:
- small FFT size which doesn’t parallelize that well on cuFFT
- initial approach of looping a 1D fft plan.
I got some performance gains by:
- Setting cuFFT to a batch mode, which reduced some initialization overheads.
- allocating the host-side memory using cudaMallocHost, which pegs the CPU-side memory and sped up transfers to GPU device space.
As a result, I’m now up to 20hz using cuFFT, versus 30hz using CPU-based FFTW.
Still seeking methods of speeding things up.
Hi,
To give a further suggestion, do you mind to share the source of both GPU/CPU-based FFTW?
Thanks.
Sizes:
int FFTSIZE = 512;
int BATCHSIZE = 4096;
int FFTSIZE_OUT = (FFTSIZE/2 + 1); //real-to-complex fft
CUDA:
//host side to hold R2C results
cudaMallocHost((void**)&FFT_data_host, FFTSIZE_OUT * BATCHSIZE * sizeof(Complex));
//device side
cudaMalloc((void**)&cudaFFTin, sizeof(cufftReal) * FFTSIZE * BATCHSIZE);
cudaMalloc((void**)&cudaFFTout, sizeof(cufftComplex) * FFTSIZE_OUT * BATCHSIZE);
//plan
cufftHandle planFFT
cufftPlan1d(&planFFT,FFTSIZE, CUFFT_R2C, BATCHSIZE);
...
copy data into FFT_data_host
...
//copy to device
cudaMemcpy(cudaFFTin, &FFT_data_host[0], FFTSIZE * BATCHSIZE * sizeof(float), cudaMemcpyHostToDevice);
//execute
cufftExecR2C(planFFT, cudaFFTin, cudaFFTout);
//wait for sync
if (cudaDeviceSynchronize() != cudaSuccess){cout << "CUDA: RangeFFT failed to synchronize" << endl;}
//copy back to host
cudaMemcpy(&FFT_data_host[0], cudaFFTout, BATCHSIZE * FFTSIZE_OUT * sizeof(cufftComplex), cudaMemcpyDeviceToHost);
FFTW:
//use the planmany interface to batch jobs together
int rank = 1;
const int n = FFTSIZE;
int howmany = BATCHSIZE;
int idist = FFTSIZE;
int odist = FFTSIZE/2+1;
int istride = 1;
int ostride = 1;
const int *inembed = NULL;
const int *onembed = NULL;
FFTW_plan = fftwf_plan_many_dft_r2c(rank, &n, howmany, FFTW_data_in, inembed, istride, idist, reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]), onembed, ostride, odist, FFTW_MEASURE);
...
copy data into FFTW_data_in
...
fftwf_execute_dft_r2c(fft_planRange, reinterpret_cast<float*>(&FFTW_data_in[0]), reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]));
Is it possible to leverage CUDA unified memory to remove the need for those cudamemCopies?
Let’s say I initialize the buffers using cudaMallocManaged(); under the hood would it still be the same speed, or is my CPU/GPU able to share the same physical ram without transfer overheads?
Thanks
For sure. You can get more information here:
[url]Programming Guide :: CUDA Toolkit Documentation