CUDA FFT lib and 1D transforms over 3D volume

I have a 3D data set of sinogram data (a series medical image projections viewed along a different dimension). Say my projection data is [x_dim y_dim z_dim] with sinogram data in the x_dim dimension then I need to perform y_dim * z_dim 1D ffts with N = x_dim on the 3D data set. The cufft library appears not to support this type of functionality with the exception of streaming and using cufftPlan1d().

After reading the Cuda FFT documentation and reviewing fftw’s documentation. It appears that for a 3d volume cufftPlan3d() and cufftPlanMany() performs a separable transform on each and every one of the dimensions. So for my purposes this cannot be used.

It also seems that none of these functions use cudaMemcpy[N]D where N is 2 or 3 i.e. 2D or 3D. cufftPlanMany performs separable transforms on N dimensions. cufftPlanMany() and cufftPlan3D perform seperable transforms along every dimension. This appears to necessitate having to perform 1D transforms over a 3D or 2D data set then rearrange the data into 2D and 3D pitched memory if further processing is needed on the memory where pitched memory (or stride) access would be desirable.

The requirement for padding, without disabling using cufftSetCompatibilityMode(), necessitates having to allocate memory, copy the data on padded boundaries and/or rearrange in GPU memory. This seems inefficient for large data sets. I.E. rearranging the data after copy or copying on boundaries. What is the best way to handle this? (1) On copy or (2) rearrange/realign in GPU?

Is my assessment of the CUDA FFT library correct?

if so:

Can 2D, 3D, Nd transforms be provided on 2D an 3D data sets which use pitched memory where the dimensions upon which to perform the transforms can be specified provided in future revisions of the fft lib?

It appears I need to do y * z linear non pitched/stride copies, create y * z streams perform fft, filter, and inverse fft, then put into pitched memory and continue on.

This library seems like square peg in round hole where pitched memory is concerned or can I use pitched memory?

To make effective use of streaming I need to create page-locked memory using cudaHostAlloc. There is no guarantee on the projection data that this is page-locked and likely is not as the pointer to this data is received by other software and would require a copy to page-locked host memory to guarantee streaming.

rock → me ← hard place

I’ll be interested to see any suggestions.
My dataset consists of 512 projections of size 512 wide by 480 high (fft’s are therefore only 512 long).
Though I do the CT backprojection using CUDA, so far I’ve been doing the fft filtering in Matlab.
It takes about 5 seconds on an Intel I7.
I’ve been meaning to do the whole job on the GPU, but as you pointed out, the CUDA library doesn’t look promising.

You could use the batching mode, since you are doing 480*512 transform of length 512.
Coupled with the streaming interface, it will be very effective.

mfatica:
Thanks. I’ll give that a look.

Please see my earlier post on streaming. This is only good for page-locked memory unless a block copy is done initially and then streamed ffts are performed correct? The goal here too is to use as little gpu memory and host memory as possible and avoid excessive copies and memory alignment issues. While I can use batching mode, I don’t believe I can use it with streaming memory and vise vera. Streaming mem copies seems out of the question, but then batch mode would then be possible.

So I think I should use block copy with batching and not use streaming memory copies… or streaming at all. Then discombobulate data in gpu into pitched memory and continue on.

I have Matlab-> C++ → GPU interface performing iradon filters (cosine, hann, hamming, shepp-logan and ram-lak to perform a cone-beam reconstruction using FDK approach. My projection data is 1240x960x133. I have currently tested the filters and they appear correct. I have not been seeing the same performance in Matlab with my data and my approach (3D version of iradon filtering). See the line “order = max(64,2^nextpow2(2*len));” in iradon (ie pad data by two for interpolation and take to next power of 2… ouch). MATLAB does not appear to take advantage of Hermitian symmetry of real data forward fft to complex in this function if I am understanding this correctly.

I’m also using a Matlab-> C++ → GPU interface but only doing the CT conebeam backprojection on the GPU. I don’t use any of Matlab’s CT routines.

The critical lines in my Matlab filtering are

P = padarray(P,[N/2 0 0]); % symmetrically zero pad to length of w = 2N

P = ifft( bsxfun(@times,fft(P),w), ‘symmetric’);

P = P(N/2+1:(3*N/2),:,:); % Trim zero-padding

where P is the 3D array of projections and w is my 1D filter function. N is the fft length, in my case 512.

I have the advantage of being able to pre-trim my projection ‘widths’ to 512 and still encompass the target. I don’t know if you can do the same to get down to 1024. Nevertheless, Matlab doesn’t require power-of-2 ffts.

The Intel I7 is a quad core with hyperthreading so it looks like 8 cores. The Matlab fft and ifft routines use all the cores.

Have you tried Jacket for this?

FFT - http://wiki.accelereyes.com/wiki/index.php/FFT

IFFT - http://wiki.accelereyes.com/wiki/index.php/IFFT

TIMES - http://wiki.accelereyes.com/wiki/index.php/TIMES

BSXFUN - http://wiki.accelereyes.com/wiki/index.php/BSXFUN

SUBSREF/SUBSASGN - http://wiki.accelereyes.com/wiki/index.php/SUBSREF

PADARRAY can be composed manually with two lines of regular MATLAB code, as follows:

P = gzeros( size of big matrix );

P( indexes where you want to put the small matrix ) = small matrix;

I’m guessing this would be super fast in Jacket, given other similar cases we’ve seen like this one.

For more information on how Jacket compares, see http://www.accelereyes.com/products/compare

Good luck and post on our forums if we can help further: http://forums.accelereyes.com

-John