Efficient memory copying with CUFFT's complex type

I’m doing some in-place real-to-complex and complex-to-real transforms. When I copy from a real data array into the array I will be FFT’ing (just using a 1D kernel storing fft_array[i].x = array1[2 * i]), I access every other element of the full array (array1), so the transfers shouldn’t be contiguous, but I still find the kernel calls are fairly fast.

However, the kernel to copy the real parts the FFT array into a different (real) array (call it array2) is about 6-7x slower. This kernel is doing copies of the form array2[8 * i] = fft_array[i].x.

I’m unsure why this second kernel is so much slower, when both transfer the same amount of data and neither is contiguous. Any general advice on optimizing such transfers would be greatly appreciated!

I am a bit confused what is being copied where here, but if these are copies between different locations in GPU memory I think you will find that the memory throughput of such copies decreases as the stride increases. You could easily create a microbenchmark for this by measuring throughput for strides of 1i, 2i, 4i, 8i, etc.

As I recall from such a test I performed many years ago, the throughput ultimately falls to something 1/16 of the maximum throughput achieved with the best possible (contiguous) case. The underlying reason is that the hardware is optimized for contiguous access comprising wide individual accesses.

Thanks, njuffa. In re: what is being copied - all in GPU memory. A field to be FFT’ed (multiple such fields live at each “site”) is copied to a single FFT array; after some manipulation of the output of the FFT, an inverse FFT is performed, the output of which is stored into a different global array.

By “wide” individual accesses, do you mean something like each thread accessing a large group (O(10)? or?) of contiguous elements, where each thread’s group is adjacent to the group of adjacent threads? Or just that the global data pulled by all the threads of a block is a large, contiguous access?

Second, are there any workaround for when accesses are necessarily strided? (I could rearrange the arrays to avoid this, but then the other half of my program would then be making strided accesses…)

By wide access I am referring to the physical accesses being made by the hardware, in particular the basic memory transaction size, which I think is 32 bytes in currently shipping GPUs (not sure). This means that when stride increases, the effective utilization of each memory access goes down. E.g. when you touch a single 4-byte quantity in a 32-byte chunk, 7/8 of available bandwidth are wasted.

So I think I don’t actually understand how CUDA performs global reads. I interpret what you say as the GPU always pulls memory in fixed-size chunks - so the reason a kernel benefits from requiring data that is contiguous in memory is that this minimizes the total number of reads that need to be performed?

Secondly, does the order in which a kernel actually requests global data have an effect? That is, if I read two adjacent global memory locations at different points in the kernel, will the GPU read this chunk twice, or is it “smart” in that it will read and store both variables with one memory access? Similarly, if each thread is reading n (~8) adjacent variables one line at a time (i.e., in a loop), and each such set being pulled thread-by-thread is adjacent in memory, will the entire read be contiguous? (Naively reading the code, one might think that the loop is performing n strided memory calls, one per iteration.)