my speedy FFT 3x faster than CUFFT

my speedy FFT

Hi, I’d like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout.

Compile using CUDA 2.0 beta or later.

Vasily

Update (Sep 8, 2008): I attached a newer version that includes n = 16, n = 256 and n = 1024 cases. Here is its performance in Gflop/s when using driver 177.11 (up to ~40% slower with the newer 177.84 version):

...n	 batch   8600GTS  8800GTX  9800GTX  GTX280

...8   2097152	22	   67	   52	   102

..16   1048576	28	   89	   68	   125

..64	262144	35	  126	  104	   231

.256	 65536	39	  143	  137	   221

.512	 32768	41	  162	  154	   298

1024	 16384	41	  159	  166	   260

Update (Jan 12, 2009): I attached a quickly patched version that supports forward and inverse FFTs for n = 8, 16, 64, 256, 512, 1024, 2048, 4096 and 8192. Here are results for CUDA 2.1 beta, driver 180.60:

Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.

Compiled with CUDA 2010.

			 --------CUFFT-------  ---This prototype---  ---two way---

...N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error

...8 1048576	8.8	9.4   1.8	 83.0   88.6   1.6	 82.8   2.1

..16  524288   19.7   15.8   2.1	 91.9   73.5   1.5	 92.5   1.8

..64  131072   55.8   29.8   2.3	185.6   99.0   2.4	185.4   3.0

.256   32768   97.1   38.8   2.2	160.1   64.1   2.0	160.1   3.0

.512   16384   65.2   23.2   2.9	240.9   85.6   2.5	240.9   3.7

1024	8192   86.7   27.8   2.7	211.4   67.7   2.5	211.9   3.9

2048	4096   50.6   14.7   3.7	160.0   46.6   3.0	159.6   4.5

4096	2048   48.9   13.0   4.0	171.0   45.6   3.3	170.2   4.9

8192	1024   47.4   11.7   4.4	185.9   45.8   3.4	184.1   5.2

Errors are supposed to be of order of 1 (ULPs).

Some of the layouts used are not straightforward. If you need filtering based on element-wise multiplication by a vector in the frequency space, I’d suggest using a similar layout for the vector.
IFFT_011209.zip (19.5 KB)
FFT_090808.zip (11 KB)
FFT_061408.zip (7.43 KB)

Very cool. Thanks for posting this. I’m excited to give it a try.

I noticed one thing that can be changed that might speed it up a little more. Change exp_i from this:

inline __device__ float2 exp_i( float phi )

{

    return make_float2( __cosf(phi), __sinf(phi) );

}

to this:

inline __device__ float2 exp_i( float phi )

{

    float sint, cost;

    __sincosf(phi, &sint, &cost);

    return make_float2( cost, sint );

}

I don’t know if it will help or not, but I think it might after noticing the loop inside twiddle. Thanks again.

Thanks for looking in the code! I tried your technique but didn’t get any difference. According to decuda, both choices produce nearly identical machine code. The only difference is that a couple of instructions were reordered and few registers were renamed. It seems that __sincosf is compiled into the same instructions as pair of __cosf and __sinf.

Vasily

Using sincos instead of the sin and cos will be faster for the full precision call.

Vasily, I was surprised to hear it didn’t make a difference - I have some rotation code that got a nice benefit from using sincosf versus sinf and cosf. Thanks to both of you for the explanation. Turns out I use the full precision calls.

Hi,

I have a couple of questions, i’m kind of a beginner so it might not be very interesting…

I’ve been working on DCTs recenlty, that’s why i’m interested in your code.

__global__ void FFT512_device( float2 *work )

{	

 �   work += (blockIdx.y * gridDim.x + blockIdx.x) * 512;

 �   

 �   int tid = threadIdx.x;

 �   int hi = tid>>3;

 �   int lo = tid&7;

 �   

 �   float2 a[8];

 �   __shared__ float smem[8*8*9];

 �   

 �   load( a, work + tid, 64 );

 �   

 �   FFT8( a );

 �   

 �   twiddle( a, tid, 512 );

 �   tranpose( a, &smem[hi*8+lo], 66, &smem[lo*66+hi], 8 );

 �   

 �   FFT8( a );

 �   

 �   twiddle( a, hi, 64 );

 �   tranpose( a, &smem[hi*8+lo], 8*9, &smem[hi*8*9+lo], 8, 0xE );

 �   

 �   FFT8( a );

 �   

 �   store( a, work + tid, 64 );

}
  1. Why don’t use put the “float2 a[8]” in shared mem ? Is it because of shared mem size limitation ?

  2. Why don’t you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don’t really understand why it should work.

I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren’t in CUDA 1.1.

I have weird results :

Thanks.

oYo

Because shared memory is substantially slower than registers.

Synchronization is needed to coordinate access to shared memory. All shared memory accesses are done in transpose(), that performs all synchronization that is required.

I was using CUDA 2.0 when writing this code. Later, I found that it runs much slower if compiled in CUDA 1.1. I could try to reoptimize the code for this earlier version of the compiler, but don’t really see the point.

If you still want to call it from an application code running in CUDA 1.1, you may compile this code into .cubin using CUDA 2.0 and then call it from CUDA 1.1 code using driver API.

But a declaration like this:

float2 a[8];

will be placed in local memory, not in registers.

I don’t think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn’t spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case.

Ok !

Sorry for the second question, i knew you did not use shared mem except in transpose … T_T

One more thing, could you give me a typical result table ? (I cannot install CUDA 2.0 at the moment)

Thx again =)

Sure, here it is:

Device: GeForce 8800 Ultra, 1512 MHz clock, 768 MB memory.

             --------CUFFT-------  ---This prototype---

   N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error

   8  524288    5.8    6.1   1.8     76.8   81.9   1.6

  64   65536   34.2   18.3   2.3    142.3   75.9   2.2

 512    8192   38.8   13.8   2.9    181.5   64.5   2.5

Errors are supposed to be of order of 1 (ULPs).

Then you have constant indexing maybe? Any array declared inside a global function is put in local memory, unless you have constant indexing (I looked briefly at your code yesterday evening, and I think you are indeed using constant indexing because your loops are unrolled, so for your case it will be put in registers indeed) This is irregardless of spilling registers over to local memory.

As an example, compile a global function that calls sincosf(). You will see that local memory is used for that kernel, even when using <10 registers, just because the index cannot be determined at compile time. (It is used in a code path that might not get triggered during runtime, so it might be that you will never see local load & store in profiler output)

I didn’t know that you can use non-constant indexing for local variables. Thanks for pointing this out!

well, you can, but it will be slow memory compared to shared memory ;)

I looked through the code and it seems like complex part of the output is not produced.

Therefore, comparison to CUDAFFT which produces true complex-to-complex FFT is not fair, is it ?

Can it explain the performance advantage over FFT reported ?

Do the authors have complex-to-complex version of the code ?

best

This is an interesting discovery as I thought that the complex part is produced. What made you think that it is not? I’d appreciate a hint.

Thanks,

Vasily

Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently.

I have another question, is it easy to modify the algorithm to produce fft of size 1024 ?

thanks a lot!

It is fairly easy to accomplish. I have produced such algorithm by upgrading n=512 case up to radix-16. Of course, 161616 > 1024, so one of the stages (the middle one) has a lower radix of 4, i.e. computes four independent radix-4 transforms instead. Higher radix requires more registers but I did manage to fit into 40 so that 3 thread blocks can run at a time per multiprocessor. It still fits into shared memory due to the particular shared memory allocation that you already found. I applied a fair effort to tune it but ultimately failed to get higher Gflop/s rate than in the n=512 case so I didn’t post it here.

Vasily, I’ve read some of your papers, and noticed that some of your algorithm improvements have been included in CUBLAS 2.0; are you going to be adding the improvements from this work into the next release of CUFFT?

I heard that NVIDIA is working on it.