Using cuFFT callbacks for FFT windowing

Am interested in using cuFFT to implement overlapping 1024-pt FFTs on a 8192-pt input dataset and is windowed (e.g. hanning window). That is, the number of batches would be 8 with 0% overlap (or 12 with 50% overlap). I followed the example in the 2014 Pro-Tip (https://devblogs.nvidia.com/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/) and used a “load” callback to multiply the time-domain input by the windowing function. With 0% overlap, everything seems okay, but it doesn’t appear to be windowing quite right when FFTs are overlap (in_distance is less than Nfft). As part of debugging, I displayed the “offset” sent to the callback routine: while the offset values vary between 0 and 8191 (as they should), the number of offsets seems to be limited to 4096. That means that they are gaps in the offset values. Is 4096 some sort of hard limit?

There isn’t any hard limit around 4096.

Thanks for the response. Can you think of any reason why I don’t see all 8192 values show up when I put a simple printf(“%d\n”,offset) in the Load Callback function?

According to the cuFFT documentation, the load callback routine is called “once and only once” for each point in the input (which in this case is 8192-pts).

GPU-side printf has limitations. You may be running into a buffer size limit.

[url]Programming Guide :: CUDA Toolkit Documentation

Good hypothesis - and correct, I think. I limited the number of printfs on the GPU by conditioning it within an “if” statement, and it looks like all 8192 offsets are being sent to the callback. All good, but in debugging this - I don’t think the cuFFT callback method will work for windowing FFTs when the input stride is less than the length of the FFT (i.e. overlapping FFTs) since the callback is called only once for each point in the input. When the input stride is less than the length of the FFT, the callback routine would have to be called more than once. Consider a 50% overlap - for example, let’s say I have my 8192-pt input, my FFT is 1024-pts, and my input stride is 512-pts. The first FFT in the fftplan should be input points [0…1023], multiplied by the 1024-pt windowing function. However, the second FFT in the fftplan should be input points [512…1535] multiplied by the same 1024-pt windowing function. So as you can see, the windowed input for points 512 to 1023 are different, depending on which FFT in the “batch” you are on?

Am I missing something, and there is a way to get the “load callback” to be called for every fft in the fftplan? I am thinking I will need to implement the windowing (and FFTs) one at a time in a grid-stride loop, instead of the efficient MakePlanMany call.

After some thought I would agree that input callbacks for overlapping transforms won’t be readily possible, currently. The reason might be different than what you are stating. I’m not 100% convinced that only one load will be done per data element (I agree only one load will be done per data element per batch item), but I don’t want to argue about that, because there is a more fundamental issue, namely that the data provided to the callback routine (pointer to data, element index) is not sufficient to distinguish which batch ID/element is actually being requested, even if there were 1 load per data element for each item in the batch.

We’re looking at ways to improve CUFFT efficiency generally, and if you’re able to provide more details, we’re interested in your use case.

  1. anything you can share about motivation or science domain that has these overlapping FFTs? What is the problem being solved?
  2. Are the sizes of 8192-pt input, 1024-pt FFT, 512-pt stride relatively common or fixed for your use case? If not, what are typical ranges for these parameters?
  3. Are there other preprocessing steps right before this FFT, or postprocessing steps right after this FFT, that you care to discuss?
  4. Are you doing work on the GPU prior to this FFT? After this FFT?

thanks!

Sure - happy to answer (especially if CUFFT can be improved)!

Just doing digital signal processing…

  1. I almost always window and overlap the FFTs. Windowing is necessary to prevent artifacts from spreading power across the spectrum. This spectral leakage is inherent due to the fact that the DFT assumes signal infinite periodicity - which is not the case if the signal freq is not a perfect multiple of the frequency resolution. So the Windowing helps reduce the leakage - by introducing a tapering of the input signal over the N points - typically these windowing functions (e.g. Hanning) go to 0 at the boundaries. All good, right? Except for the fact that since the window goes to zero at the boundaries, if there is a signal there, you get an amplitude reduction in the resulting FFT: not good. Hence the overlap - with a 50% overlap, you are now centering an FFT window right over the spot where you were losing signal…so you are now guaranteed to capture any input signal without amplitude loss…regardless of the freuqency.
  2. The sizes I gave are just notional examples: I regularly use a wide variety of 2^N FFT lengths
  3. No other pre-processing steps (data is coming from an ADC). Lots of postprocessing after the FFTs, but not relevant right now.
  4. This is my first venture in GPUs. The plan is to do as much processing and data reduction on the GPU as possible - and then get it to the host.

I hope this helps.

Also, I should probably start a new thread, but since I have your attention:

There may be a bug in the cufftMakePlanMany call for CUFFT_C2C types, regarding the output distance parameter (odist). For CUFFT_R2C types, I can change odist and see a commensurate change in resulting workSize. However, for CUFFT_C2C, it seems that odist has no effect, and the effective odist corresponds to Nfft. It’s possible that I don’t quite understand how workSize works, but the behavior between R2C and CFC is definitely different.

Thanks.

I’m also in a desperate need for a cuFFT implementation of overlapping windowed FFTs. A typical example is an input of about 10000pts, a window of 3072pts, 4096pts FFT, with strides of 3.
One will then also need a zeropad of 1024pts before conversion, although it might be possible to solve that with longer windows with a zero tail.
So in total about 2300 4096pts FFTs per input sequence.

Currently I need to loop over the data to apply the window, which is slow and non-appealing solution. Speed-wise it’s about the same as the CPU case.

This thread has been quiet now for some time, so have there been any developments recently? Or, maybe, there is a solution that I’ve missed?

I am unaware of any development/update to make cuFFT work for overlapped and windowed FFTs. I had assumed that, unfortunately, nothing was done: I was forced to resort to an alternative method - which is/was not nearly as elegant and as simple as the load callback would have been…but at least it is functional. Sorry to be of no help here - wish the NVidia folks could have come through for me.

I’m also interested in having this feature supported. Is there any plan to have this in cuFFT?

Generally speaking, you won’t get answers to forward-looking questions on these forums. I’m not saying it never happens, but there are significant issues in making such statements, so in my experience its rare for NVIDIA to comment that way.

A good way to log your interest is by filing a bug using the instructions linked in a sticky post at the top of the CUDA programming sub-forum. If the development teams need any answers to clarifying questions they will use that vehicle to ask them.

In your bug report you can use the word “enhancement” which will help clarify it, and link back to this forum thread if you like.

The load callback can be used effectively to window data for overlapping DFTs. The trick is to configure CUDA FFT to do non-overlapping DFTs, and use the load callback to select the correct sample using the input buffer pointer and sample offset. For example, if you want to do 1024-pt DFTs on an 8192-pt data set with 50% overlap, you would configure as follows:

    int rank = 1;                                          // 1D FFTs
    int n[] = { 1024 };                                // Size of the Fourier transform
    int istride = 1, ostride = 1;                 // Distance between two successive input/output elements
    int idist = 1024, odist = 513;            // Distance between batches
    int inembed[] = { 0 };                          // (ignored for 1D transforms)
    int onembed[] = { 0 };                         // (ignored for 1D transforms)

    cufftPlanMany(  &handle, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, 15 );

cufftExecR2C() is called with an input buffer sized for 8192 real samples, and an output buffer sized for 7695 (15 * 513) complex results.

The magic is in the load callback. We are going to use it to map non-overlapped DFT samples to overlapped DFT samples. Because we configured for non-overlapping DFTs, the load callback gets called for every sample in every DFT, and the offset tells us where the sample is located in the input buffer. However, we will use the offset to determine the sample location in terms of batch number (offset/1024) and offset into the batch (offset%1024) so that we can find the overlapped DFT sample:

                         overlapped_sample = input_buffer[ (batch_number * 512) + batch_offset ]

With the batch_offset, we can apply the appropriate windowing factor.

device cufftReal
load_callback( void *dataIn, size_t offset, void *callerInfo, void *sharedPtr )
{
cufftReal *in_buffer= (cufftReal *)dataIn;

    int batch_number = offset / 1024;
    int batch_offset    = offset % 1024;
    int buffer_offset   = (batch_number * 512) + batch_offset;

    return ( in_buffer[buffer_offset] * hanning_window[batch_offset]  );

}

2 Likes

Another approach is to use cuFFTDx which allows for greater control of what happens with FFT by allowing to embed one in CUDA kernel. This allows to implement windowing directly.

https://docs.nvidia.com/cuda/cufftdx/index.html

1 Like

@david.brian.richards this was very helpful to what i am working on. I implemented a version of this for a project, but the resulting spectra does not look correct. I believe there are a few parameters in your example above that I may not be understanding. I will try to put them into a few questions:

  1. What did you mean by “Because we configured for non-overlapping DFTs” ? Is that because of the dist=1024 and odist=513 ? I am not sure how this is determined.
  2. When cufftExec stores the results from the first 0…1023 samples, then subsequently stores the results for the next 1024-2047 samples… i take it that odist=513 is simply forcing the store to overwrite the results from the first fft in the output buffer 513-1023 ?
  3. I assume cufftExec is not smart enough to know that once half way through batch 15 we exceed the input buffer extent of 8192 and to not call load_callback on that sample…? so in otherwords… i attempted this got a illegal mem access when)
  4. in 3 above, how did you handle overlapped_sample trying to access an out of bounds index into the input buffer? just return 0 from the callback or something else?

Given the following parameters, I think i have set this up correctly, but could use a sanity check if you have a moment:

  • input and output buffer sized for 3072000 complex samples
  • overlap 50% , 975238-pt DFTs
  • idist=975238
  • odist=487620
  • num_fft_batches=5
  int batch_number = offset / 975238;
    int batch_offset    = offset % 975238;
    int buffer_offset   = (batch_number * 487619) + batch_offset;

Thanks in advance!

Hi @bill.urrego,

The “idist” parameter determines overlapping. In this example, it is set to 1024 which is the DFT point size. You would set it to 512 if you wanted cuFFT to do a 50% overlap. In this example, the “odist” parameter is set to the output size of the DFT (513), so that the DFT results are stored sequentially and not overwritten in the output buffer (i.e., DFT1 in output buffer location 0-512, DFT2 in 513-1025, etc.).

In this example, the 15th DFT does not over-read the input buffer because 15th DFT lands on the last 1024 bytes of the input buffer.

Your project parameters looks correct to me. I assume that your “istride” and “ostride” are both set to one. Also, you must use separate input and output buffers, otherwise cuFFT will overwrite input data we need to do overlapping (one byte at a time).

I hope this helps.

1 Like

@david.brian.richards Yes! thank you very helpful. It didn’t click till reading your reply that idist == N results in no overlap.

Last question I have is on computing the required number of fft batches. I believe its just
(input buffer length / N) * (1.0 / 1- overlap) - 1
( 8192 / 1024) * ( 1.0 / (1-0.5) ) - 1

But i cant really explain the - 1 at the end… does that just discount the first batch on 0…1023 samples ?

Thanks again!

@bill.urrego, I’m reasonably sure the proper way to calculate the number of whole DFTs without padding or overflow is:

num_dfts = floor( (input_buf_size - N) / (N - overlap)) + 1 /* for input_buf_size >= N */

floor((8192-1024)/(1024-512) + 1 = 15
floor((3072000 - 975238)/(975238 - 487619)) + 1 = 5

I hope that helps.

1 Like

Thank you @david.brian.richards, yes that helps