Cannot force kernels to concurrent execution

szalejot · April 24, 2012, 2:28pm

Hi,

I have GT 540M in my laptop. Concurent kernels demo from SDK sample works.

In my application I have few CPU threads, each calls kernel functions.

Part of every thread code:

CUDA_SAFE_CALL( cudaStreamCreate( &m_stream ) );

	CUDA_SAFE_CALL( cudaGetDeviceProperties(&deviceProp, device) );

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_corr, corr_size * 3));

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_img, img_size));

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_pattern, 2*ss*sizeof(int)));

	CUDA_SAFE_CALL( cudaMemcpyAsync( d_img, h_img, img_size, cudaMemcpyHostToDevice, m_stream) );

	CUDA_SAFE_CALL( cudaMemcpyAsync( d_pattern, h_pattern, 2*ss*sizeof(int), cudaMemcpyHostToDevice, m_stream) );

	CUDA_SAFE_CALL( cudaMemsetAsync( d_corr, 0, corr_size * 3, m_stream));

	dim3 grid( 1, 1, 1);

	dim3 threads(threadsNum, 1, 1);

	CorrExtrGpu<<< grid, threads, 0, m_stream >>>(

				d_img,

				img.SizeX(),

				img.SizeY(),

				d_corr,

				m_binNo,

				m_off,

				d_pattern,

				(int)(2 * ss));

	CUT_CHECK_ERROR("Kernel ExtrGpu execution failed");

	CUDA_SAFE_CALL( cudaMemcpyAsync( h_corr, d_corr, corr_size * 3, cudaMemcpyDeviceToHost, m_stream) );

	cudaStreamSynchronize(m_stream);

	CUDA_SAFE_CALL( cudaFree(d_img) );

	CUDA_SAFE_CALL( cudaFree(d_pattern) );

	CUDA_SAFE_CALL( cudaFree(d_corr) );

	CUDA_SAFE_CALL( cudaStreamDestroy( m_stream ) );

And I made (in purpose testing) in kernel dummy loop, single kernel execution time is about 1.5 sec.

And there is no is no concurrency. If I change m_stream to 0 in calls summary application work time is the same. During tests I ran 8 CPU threads.

Why? Where am I making the mistake?

PS. Profiler says, that kernel time is about 93% of GPU execution time, so memcpy’s are no problem for me.

apostglen46 · April 24, 2012, 9:43pm

The number of CPU threads has nothing to do with concurrency.The CUDA runtime should partition the streams into threads automatically.
I have a similar problem in this thread The Official NVIDIA Forums | NVIDIA and I haven’t found a solution yet.Maybe you can help.

If you provide the whole code I could try it on my system as well.
Thank you in advance,

Apostolis

Gilles_C · April 25, 2012, 7:01am

Hi,

In your code, you cannot get any concurrency since you’re using one single stream. The point here is that cuda detects dependencies according to streams, which means that for the cuda run time, whatever is enqueued in a stream queue implicitly depends on what has been enqueued in the same stream beforehand. Here, since you only use one single stream (should it be 0 or any other one), all you actions are enqueued in the same queue and processed one after the other. If you want some of those actions to get overlapped, you’ll have to define as many queues (==streams) as needed, and manage the concurrency by hand.

GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.

See this webinar for more details.

Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don’t need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.

Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):

CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers

                                h_img,

                                img.SizeX(),

                                img.SizeY(),

                                h_corr,

                                m_binNo,

                                m_off,

                                h_pattern,

                                (int)(2 * ss));

szalejot · April 25, 2012, 9:11am

Ok, I read some CUDA Programmer guide.

I create more than one stream, because each CPU threads creates its own stream.
But it seems, that by default every CPU thread have diffrent GPU context, so concurrency is not possible (correct me, if I’m wrong).

How to force CPU threads to use the same GPU context?

szalejot · April 25, 2012, 3:23pm

her. If you want some of those actions to get overlapped, you’ll have to define as many queues (==streams) as needed, and manage the concurrency by hand.

GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.

See this webinar for more details.

Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don’t need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.

Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):
CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers

                                h_img,

                                img.SizeX(),

                                img.SizeY(),

                                h_corr,

                                m_binNo,

                                m_off,

                                h_pattern,

                                (int)(2 * ss));

Thanks for reply. What I did, is allocating h_corr and h_pattern:

CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_corr, corr_size, cudaHostAllocMapped));

CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_pattern, pattern_size, cudaHostAllocMapped));

and registering h_img (it’s allocated using malloc() by other module, so I can’t change it’s allocating manner)

CUDA_SAFE_CALL( cudaHostRegister((void*)h_img, img_size, cudaHostRegisterMapped) );

First lines in main are:

cudaSetDevice(0);

cudaSetDeviceFlags(cudaDeviceMapHost);

And now I’ve got

Cuda error in file 'ftrcorrgpu.cu' in line 213 : unknown error.

which is:

CUDA_SAFE_CALL( cudaHostUnregister((void*)h_img) );

after kernel call.

So close but so far…

szalejot · April 26, 2012, 11:58am

Ok, after few hours of coding and testing I determined, that concurrent kernels works, when kernel launches are from one CPU thread.

But when multiple CPU threads are launching kernels, there are not concurrent. I copied fragment of concurrentKernels SDK example as thread code:

__global__ void clock_block(clock_t clock_count)

{

	clock_t start_clock = clock();

	clock_t clock_offset = 0;

	while( clock_offset < clock_count ) {

		clock_offset = clock() - start_clock;

	}

}

void TestClass::testConcurrent(void)

{

	clock_t time_clocks = 1000 * m_clockRate;

	dim3 grid2(1, 1, 1);

	dim3 threads2(1, 1, 1);

	clock_block<<< grid2, threads2, 0, m_stream >>>(time_clocks);

	cudaStreamSynchronize(m_stream);

	return;

}

Profiler shows me, that every thread stream are in the same context (see attachment

). But there are not concurrent (4 CPU threads, 1s kernel work time, every CPU thread fires kernel two times. Total application run time = 8s, so no concurrency at all).

Device query shows me, that my graphic card is OK for concurrent kernels:

Device 0: "GeForce GT 540M"

  CUDA Driver Version / Runtime Version          4.2 / 4.2

  CUDA Capability Major/Minor version number:    2.1

Is there any way, to launch concurrent kernels from separate CPU threads? I really need that, because I have some GPU and CPU computations which I want to run in parallel.

apostglen46 · April 26, 2012, 3:33pm

Async memcpys are asynchronous to the host (they don’t block CPU computation).Why do you need to manually create threads?

szalejot · April 27, 2012, 6:55am

Because I want to use full possible power of CPU+GPU, so I want to force to work GPU+(all CPU cores).

For me easy solution was CPU threads having their own streams. But it seems, that streams from different CPU threads don’t want to work concurrently.

apostglen46 · April 28, 2012, 2:08pm

I would suggest putting all streams in the same thread.
And proceed with your CPU algorithm as usual.What will probably resolve any thread-streams problem.