cuFFT + streams

I am doing multiple streams on FFT transform. What is wrong with my code?
It generates the wrong output. The FFT plan succeedes.
h_Data is set.

When I run this code, the display driver recovers, which, I guess, means something is being computed for really long.

static cufftComplex h_Data[sig*tds];
	cufftComplex *d_Data;

	const unsigned int bytesComplex = sizeof(cufftComplex) * sig * tds;

// asynchronous version
	// stream stuff
	const int blockSize = tds, nStreams = 4;
	const int nn = 320 * blockSize * nStreams;
	const int streamSize = nn / nStreams;
	const int streamBytes = streamSize * sizeof(cufftComplex);
	// create cuda streams
	cudaStream_t streams[nStreams];
	for (int i = 0; i < nStreams; i++)
	{
		check(cudaStreamCreate(&streams[i]));
	}
	// create cufft plans and set them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// GPU start time measurement
	check(cudaEventRecord(start));
	// CPU pinned memory allocation
	cufftComplex *h_DataPtr;
	h_DataPtr = h_Data;
	check(cudaMallocHost((void**)&h_DataPtr,bytesComplex));
	// GPU memory allocation
	check(cudaMalloc((void**)&d_Data,bytesComplex));
	// copy input data to GPU for processing
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice,streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],&d_Data[offset],&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		 cudaMemcpyAsync(&h_Data[offset], &d_Data[offset], 
                               streamBytes, cudaMemcpyDeviceToHost,
                               streams[i]) ;
	}
	// GPU stop time measurement
	cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&milliseconds, start, stop);
	std::cout<<"GPU timing by GPU (async): "<< milliseconds << " ms" <<endl;

here is a fully worked example:

[url]parallel processing - Asynchronous executions of CUDA memory copies and cuFFT - Stack Overflow

Ok guys. Since no article could help me solve my problem, I figured this out by myself.
The case is that I am using streamed cufftExecC2C function on (batch = 256 signals) with 1280 samples per each.
Every loop iterates on:

  • cudaMemcpyAsync
  • cufftPlanMany, cufftSet Stream
  • cufftExecC2C
// Creates cuFFT plans and sets them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// Async memcopyes and computations
	//check(cudaMemcpy(d_Data,h_Data,bytesComplex,cudaMemcpyHostToDevice));
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice, streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],(cufftComplex*)&d_Data[offset],(cufftComplex*)&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}

The case was to divide the BATCH number by the number of streams, i.e. 256/4 (at my example) at cufftPlanMany function.

I am leaving this thoughts for future generations.

Is it possible to assign memcpy (host to device) to stream1, cufftplan to stream2 and memcpy (device to host) to stream 3 so that all three of those things can overlap?

not cufft plan, but cufft execution, yes, it should be possible. cufft has the ability to set streams. The example code linked in comment 2 above demonstrates this.

Example code only show when you want to run 3 separate ffts. He uses a stream to overlap the copy, fft copy process of three separate processes right?

I guess I am curious if I can overlap memory copy with the actual fft computation? I am currently not getting correct values when I do so.

Also how do I associate a stream with cufftexec? I only see cufftsetstream which associates it with a specific plan.

You don’t associate a stream with cufftexec. You associate a stream with the plan (that you pass to cufftexec). If you’re not getting correct cufft results, you might be attempting to reuse a plan with different settings. That can be done, but may require you to manage plan-associated memory yourself. CUFFT provides mechanisms to do this.

Okay, that’s what I meant in my first question, my bad for not being clear.

So it is possible to overlap the reading of data with the execution of cufft?

I am worried it’s performing cufft when it doesn’t have all the needed data yet? How should synchronization work with the overlap of copy with cufft with the same data?