cuFFT + streams

dobislaw · October 19, 2014, 11:07am

I am doing multiple streams on FFT transform. What is wrong with my code?
It generates the wrong output. The FFT plan succeedes.
h_Data is set.

When I run this code, the display driver recovers, which, I guess, means something is being computed for really long.

static cufftComplex h_Data[sig*tds];
	cufftComplex *d_Data;

	const unsigned int bytesComplex = sizeof(cufftComplex) * sig * tds;

// asynchronous version
	// stream stuff
	const int blockSize = tds, nStreams = 4;
	const int nn = 320 * blockSize * nStreams;
	const int streamSize = nn / nStreams;
	const int streamBytes = streamSize * sizeof(cufftComplex);
	// create cuda streams
	cudaStream_t streams[nStreams];
	for (int i = 0; i < nStreams; i++)
	{
		check(cudaStreamCreate(&streams[i]));
	}
	// create cufft plans and set them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// GPU start time measurement
	check(cudaEventRecord(start));
	// CPU pinned memory allocation
	cufftComplex *h_DataPtr;
	h_DataPtr = h_Data;
	check(cudaMallocHost((void**)&h_DataPtr,bytesComplex));
	// GPU memory allocation
	check(cudaMalloc((void**)&d_Data,bytesComplex));
	// copy input data to GPU for processing
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice,streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],&d_Data[offset],&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		 cudaMemcpyAsync(&h_Data[offset], &d_Data[offset], 
                               streamBytes, cudaMemcpyDeviceToHost,
                               streams[i]) ;
	}
	// GPU stop time measurement
	cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&milliseconds, start, stop);
	std::cout<<"GPU timing by GPU (async): "<< milliseconds << " ms" <<endl;

Robert_Crovella · October 19, 2014, 2:15pm

here is a fully worked example:

[url]parallel processing - Asynchronous executions of CUDA memory copies and cuFFT - Stack Overflow

dobislaw · October 23, 2014, 8:17am

Ok guys. Since no article could help me solve my problem, I figured this out by myself.
The case is that I am using streamed cufftExecC2C function on (batch = 256 signals) with 1280 samples per each.
Every loop iterates on:

cudaMemcpyAsync
cufftPlanMany, cufftSet Stream
cufftExecC2C

// Creates cuFFT plans and sets them in streams
	cufftHandle* fftPlans = (cufftHandle*)malloc(sizeof(cufftHandle)*nStreams);
	for (int i = 0; i < nStreams; i++)
	{
		status = cufftPlanMany(&fftPlans[i],1,n,inembed,istride,idist,onembed,ostride,odist,CUFFT_C2C,batch/nStreams);
		if (status != CUFFT_SUCCESS)  
		{
			cout << "Cufft FFT plan error: " << status << endl;
		}
		cufftSetStream(fftPlans[i],streams[i]);
	}
	// Async memcopyes and computations
	//check(cudaMemcpy(d_Data,h_Data,bytesComplex,cudaMemcpyHostToDevice));
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		cudaMemcpyAsync(&d_Data[offset], &h_Data[offset], streamBytes, cudaMemcpyHostToDevice, streams[i]);
	}
	// FFT of a signal
	for (int i = 0; i < nStreams; ++i)
	{
		int offset = i * streamSize;
		status = cufftExecC2C(fftPlans[i],(cufftComplex*)&d_Data[offset],(cufftComplex*)&d_Data[offset],CUFFT_FORWARD);
		if (status != CUFFT_SUCCESS)
		{
			cout << "Cufft FFT work error: " << status << endl;
		}
	}

The case was to divide the BATCH number by the number of streams, i.e. 256/4 (at my example) at cufftPlanMany function.

I am leaving this thoughts for future generations.

yutong.zhang · May 17, 2018, 12:08am

Is it possible to assign memcpy (host to device) to stream1, cufftplan to stream2 and memcpy (device to host) to stream 3 so that all three of those things can overlap?

Robert_Crovella · May 17, 2018, 2:13am

not cufft plan, but cufft execution, yes, it should be possible. cufft has the ability to set streams. The example code linked in comment 2 above demonstrates this.

yutong.zhang · May 17, 2018, 3:34pm

Example code only show when you want to run 3 separate ffts. He uses a stream to overlap the copy, fft copy process of three separate processes right?

I guess I am curious if I can overlap memory copy with the actual fft computation? I am currently not getting correct values when I do so.

yutong.zhang · May 17, 2018, 3:36pm

Also how do I associate a stream with cufftexec? I only see cufftsetstream which associates it with a specific plan.

Robert_Crovella · May 18, 2018, 2:15pm

You don’t associate a stream with cufftexec. You associate a stream with the plan (that you pass to cufftexec). If you’re not getting correct cufft results, you might be attempting to reuse a plan with different settings. That can be done, but may require you to manage plan-associated memory yourself. CUFFT provides mechanisms to do this.

yutong.zhang · May 18, 2018, 3:28pm

Okay, that’s what I meant in my first question, my bad for not being clear.

So it is possible to overlap the reading of data with the execution of cufft?

I am worried it’s performing cufft when it doesn’t have all the needed data yet? How should synchronization work with the overlap of copy with cufft with the same data?