Cannot force kernels to concurrent execution
Hi,

I have GT 540M in my laptop. Concurent kernels demo from SDK sample works.
In my application I have few CPU threads, each calls kernel functions.

Part of every thread code:
[code]CUDA_SAFE_CALL( cudaStreamCreate( &m_stream ) );
CUDA_SAFE_CALL( cudaGetDeviceProperties(&deviceProp, device) );

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_corr, corr_size * 3));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_img, img_size));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_pattern, 2*ss*sizeof(int)));

CUDA_SAFE_CALL( cudaMemcpyAsync( d_img, h_img, img_size, cudaMemcpyHostToDevice, m_stream) );
CUDA_SAFE_CALL( cudaMemcpyAsync( d_pattern, h_pattern, 2*ss*sizeof(int), cudaMemcpyHostToDevice, m_stream) );
CUDA_SAFE_CALL( cudaMemsetAsync( d_corr, 0, corr_size * 3, m_stream));

dim3 grid( 1, 1, 1);
dim3 threads(threadsNum, 1, 1);

CorrExtrGpu<<< grid, threads, 0, m_stream >>>(
d_img,
img.SizeX(),
img.SizeY(),
d_corr,
m_binNo,
m_off,
d_pattern,
(int)(2 * ss));
CUT_CHECK_ERROR("Kernel ExtrGpu execution failed");


CUDA_SAFE_CALL( cudaMemcpyAsync( h_corr, d_corr, corr_size * 3, cudaMemcpyDeviceToHost, m_stream) );

cudaStreamSynchronize(m_stream);

CUDA_SAFE_CALL( cudaFree(d_img) );
CUDA_SAFE_CALL( cudaFree(d_pattern) );
CUDA_SAFE_CALL( cudaFree(d_corr) );
CUDA_SAFE_CALL( cudaStreamDestroy( m_stream ) );[/code]
And I made (in purpose testing) in kernel dummy loop, single kernel execution time is about 1.5 sec.
And there is no is no concurrency. If I change [i]m_stream[/i] to [i]0[/i] in calls summary application work time is the same. During tests I ran 8 CPU threads.
Why? Where am I making the mistake?

PS. Profiler says, that kernel time is about 93% of GPU execution time, so memcpy's are no problem for me.
Hi,



I have GT 540M in my laptop. Concurent kernels demo from SDK sample works.

In my application I have few CPU threads, each calls kernel functions.



Part of every thread code:

CUDA_SAFE_CALL( cudaStreamCreate( &m_stream ) );

CUDA_SAFE_CALL( cudaGetDeviceProperties(&deviceProp, device) );



CUDA_SAFE_CALL( cudaMalloc( (void**) &d_corr, corr_size * 3));

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_img, img_size));

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_pattern, 2*ss*sizeof(int)));



CUDA_SAFE_CALL( cudaMemcpyAsync( d_img, h_img, img_size, cudaMemcpyHostToDevice, m_stream) );

CUDA_SAFE_CALL( cudaMemcpyAsync( d_pattern, h_pattern, 2*ss*sizeof(int), cudaMemcpyHostToDevice, m_stream) );

CUDA_SAFE_CALL( cudaMemsetAsync( d_corr, 0, corr_size * 3, m_stream));



dim3 grid( 1, 1, 1);

dim3 threads(threadsNum, 1, 1);



CorrExtrGpu<<< grid, threads, 0, m_stream >>>(

d_img,

img.SizeX(),

img.SizeY(),

d_corr,

m_binNo,

m_off,

d_pattern,

(int)(2 * ss));

CUT_CHECK_ERROR("Kernel ExtrGpu execution failed");





CUDA_SAFE_CALL( cudaMemcpyAsync( h_corr, d_corr, corr_size * 3, cudaMemcpyDeviceToHost, m_stream) );



cudaStreamSynchronize(m_stream);



CUDA_SAFE_CALL( cudaFree(d_img) );

CUDA_SAFE_CALL( cudaFree(d_pattern) );

CUDA_SAFE_CALL( cudaFree(d_corr) );

CUDA_SAFE_CALL( cudaStreamDestroy( m_stream ) );


And I made (in purpose testing) in kernel dummy loop, single kernel execution time is about 1.5 sec.

And there is no is no concurrency. If I change m_stream to 0 in calls summary application work time is the same. During tests I ran 8 CPU threads.

Why? Where am I making the mistake?



PS. Profiler says, that kernel time is about 93% of GPU execution time, so memcpy's are no problem for me.

#1
Posted 04/24/2012 02:28 PM   
The number of CPU threads has nothing to do with concurrency.The CUDA runtime should partition the streams into threads automatically.
I have a similar problem in this thread http://forums.nvidia.com/index.php?showtopic=227992 and I haven't found a solution yet.Maybe you can help.

If you provide the whole code I could try it on my system as well.
Thank you in advance,

Apostolis
The number of CPU threads has nothing to do with concurrency.The CUDA runtime should partition the streams into threads automatically.

I have a similar problem in this thread http://forums.nvidia.com/index.php?showtopic=227992 and I haven't found a solution yet.Maybe you can help.



If you provide the whole code I could try it on my system as well.

Thank you in advance,



Apostolis

#2
Posted 04/24/2012 09:43 PM   
Hi,
In your code, you cannot get any concurrency since you're using one single stream. The point here is that cuda detects dependencies according to streams, which means that for the cuda run time, whatever is enqueued in a stream queue implicitly depends on what has been enqueued in the same stream beforehand. Here, since you only use one single stream (should it be 0 or any other one), all you actions are enqueued in the same queue and processed one after the other. If you want some of those actions to get overlapped, you'll have to define as many queues (==streams) as needed, and manage the concurrency by hand.
GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.
See [url="http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf"]this[/url] [url="http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrency_Jan_2012.mp4"]webinar[/url] for more details.

Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don't need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.
Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):[code]
CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers
h_img,
img.SizeX(),
img.SizeY(),
h_corr,
m_binNo,
m_off,
h_pattern,
(int)(2 * ss));
[/code]
Hi,

In your code, you cannot get any concurrency since you're using one single stream. The point here is that cuda detects dependencies according to streams, which means that for the cuda run time, whatever is enqueued in a stream queue implicitly depends on what has been enqueued in the same stream beforehand. Here, since you only use one single stream (should it be 0 or any other one), all you actions are enqueued in the same queue and processed one after the other. If you want some of those actions to get overlapped, you'll have to define as many queues (==streams) as needed, and manage the concurrency by hand.

GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.

See this webinar for more details.



Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don't need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.

Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):


CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers

h_img,

img.SizeX(),

img.SizeY(),

h_corr,

m_binNo,

m_off,

h_pattern,

(int)(2 * ss));

#3
Posted 04/25/2012 07:01 AM   
Ok, I read some CUDA Programmer guide.

I create more than one stream, because each CPU threads creates its own stream.
But it seems, that by default every CPU thread have diffrent GPU context, so concurrency is not possible (correct me, if I'm wrong).

How to force CPU threads to use the same GPU context?
Ok, I read some CUDA Programmer guide.



I create more than one stream, because each CPU threads creates its own stream.

But it seems, that by default every CPU thread have diffrent GPU context, so concurrency is not possible (correct me, if I'm wrong).



How to force CPU threads to use the same GPU context?

#4
Posted 04/25/2012 09:11 AM   
[quote name='Gilles_C' date='25 April 2012 - 07:01 AM' timestamp='1335337317' post='1400699']
her. If you want some of those actions to get overlapped, you'll have to define as many queues (==streams) as needed, and manage the concurrency by hand.
GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.
See [url="http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf"]this[/url] [url="http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrency_Jan_2012.mp4"]webinar[/url] for more details.

Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don't need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.
Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):[code]
CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers
h_img,
img.SizeX(),
img.SizeY(),
h_corr,
m_binNo,
m_off,
h_pattern,
(int)(2 * ss));
[/code]
[/quote]

Thanks for reply. What I did, is allocating [i]h_corr[/i] and [i]h_pattern[/i]:
[code]CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_corr, corr_size, cudaHostAllocMapped));
CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_pattern, pattern_size, cudaHostAllocMapped));[/code]
and registering [i]h_img[/i] (it's allocated using [i]malloc()[/i] by other module, so I can't change it's allocating manner)
[code]CUDA_SAFE_CALL( cudaHostRegister((void*)h_img, img_size, cudaHostRegisterMapped) );[/code]

First lines in main are:
[code]cudaSetDevice(0);
cudaSetDeviceFlags(cudaDeviceMapHost);[/code]

And now I've got [code]Cuda error in file 'ftrcorrgpu.cu' in line 213 : unknown error.[/code] which is: [code]CUDA_SAFE_CALL( cudaHostUnregister((void*)h_img) );[/code] after kernel call.

So close but so far...
[quote name='Gilles_C' date='25 April 2012 - 07:01 AM' timestamp='1335337317' post='1400699']

her. If you want some of those actions to get overlapped, you'll have to define as many queues (==streams) as needed, and manage the concurrency by hand.

GPUs with compute capability 2.x have the ability to overlap data copies and kernel runs, with (in general) 2 copy engines. This leads to the potential of overlapping up to 3 actions: copy H2D, kernel run and copy D2H. Therefore, using more than 3 streams per device in normally not necessary. However, you have to manage carefully both the buffering of your data transfers to allow for the overlap, and the enqueueing order to avoid false dependency at the copy engine level.

See this webinar for more details.



Now, as a side note, remember that the UVA feature available on device of compute capability 2.0 onward allows you to directly use pointers to the host memory within your kernels. If your kernels don't need to reuse the transferred data after initial processing (no need of keeping the data on the device), using this feature rather than manually managing the transfers leads to a much (much) cleaner code, and a typically as effective one as the most optimised one you could come up with with a manual transfer optimisation.

Just for the sake of testing this, try the following on your machine (maximum performance if you used cudaHostAlloc with the cudaHostAllocMapped switch for allocating your host memory):


CorrExtrGpu<<< grid, threads>>>( // do not allocate explicitly any device memory, just use the host pointers

h_img,

img.SizeX(),

img.SizeY(),

h_corr,

m_binNo,

m_off,

h_pattern,

(int)(2 * ss));






Thanks for reply. What I did, is allocating h_corr and h_pattern:

CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_corr, corr_size, cudaHostAllocMapped));

CUDA_SAFE_CALL( cudaHostAlloc( (void**)&h_pattern, pattern_size, cudaHostAllocMapped));


and registering h_img (it's allocated using malloc() by other module, so I can't change it's allocating manner)

CUDA_SAFE_CALL( cudaHostRegister((void*)h_img, img_size, cudaHostRegisterMapped) );




First lines in main are:

cudaSetDevice(0);

cudaSetDeviceFlags(cudaDeviceMapHost);




And now I've got
Cuda error in file 'ftrcorrgpu.cu' in line 213 : unknown error.
which is:
CUDA_SAFE_CALL( cudaHostUnregister((void*)h_img) );
after kernel call.



So close but so far...

#5
Posted 04/25/2012 03:23 PM   
Ok, after few hours of coding and testing I determined, that concurrent kernels works, when kernel launches are from one CPU thread.
But when multiple CPU threads are launching kernels, there are not concurrent. I copied fragment of [i]concurrentKernels[/i] SDK example as thread code:
[code]__global__ void clock_block(clock_t clock_count)
{
clock_t start_clock = clock();

clock_t clock_offset = 0;

while( clock_offset < clock_count ) {
clock_offset = clock() - start_clock;
}
}


void TestClass::testConcurrent(void)
{
clock_t time_clocks = 1000 * m_clockRate;
dim3 grid2(1, 1, 1);
dim3 threads2(1, 1, 1);
clock_block<<< grid2, threads2, 0, m_stream >>>(time_clocks);
cudaStreamSynchronize(m_stream);
return;
}[/code]

Profiler shows me, that every thread stream are in the same context (see attachment [attachment=25397:kernels.jpg]). But there are not concurrent (4 CPU threads, 1s kernel work time, every CPU thread fires kernel two times. Total application run time = 8s, so no concurrency at all).
Device query shows me, that my graphic card is OK for concurrent kernels:
[code]Device 0: "GeForce GT 540M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 2.1[/code]

Is there any way, to launch concurrent kernels from separate CPU threads? I really need that, because I have some GPU and CPU computations which I want to run in parallel.
Ok, after few hours of coding and testing I determined, that concurrent kernels works, when kernel launches are from one CPU thread.

But when multiple CPU threads are launching kernels, there are not concurrent. I copied fragment of concurrentKernels SDK example as thread code:

__global__ void clock_block(clock_t clock_count)

{

clock_t start_clock = clock();



clock_t clock_offset = 0;



while( clock_offset < clock_count ) {

clock_offset = clock() - start_clock;

}

}





void TestClass::testConcurrent(void)

{

clock_t time_clocks = 1000 * m_clockRate;

dim3 grid2(1, 1, 1);

dim3 threads2(1, 1, 1);

clock_block<<< grid2, threads2, 0, m_stream >>>(time_clocks);

cudaStreamSynchronize(m_stream);

return;

}




Profiler shows me, that every thread stream are in the same context (see attachment [attachment=25397:kernels.jpg]). But there are not concurrent (4 CPU threads, 1s kernel work time, every CPU thread fires kernel two times. Total application run time = 8s, so no concurrency at all).

Device query shows me, that my graphic card is OK for concurrent kernels:

Device 0: "GeForce GT 540M"

CUDA Driver Version / Runtime Version 4.2 / 4.2

CUDA Capability Major/Minor version number: 2.1




Is there any way, to launch concurrent kernels from separate CPU threads? I really need that, because I have some GPU and CPU computations which I want to run in parallel.
Attachments

kernels.jpg

#6
Posted 04/26/2012 11:58 AM   
Async memcpys are asynchronous to the host (they don't block CPU computation).Why do you need to manually create threads?
Async memcpys are asynchronous to the host (they don't block CPU computation).Why do you need to manually create threads?

#7
Posted 04/26/2012 03:33 PM   
[quote name='apostglen46' date='26 April 2012 - 03:33 PM' timestamp='1335454421' post='1401247']
Async memcpys are asynchronous to the host (they don't block CPU computation).Why do you need to manually create threads?
[/quote]

Because I want to use full possible power of CPU+GPU, so I want to force to work GPU+(all CPU cores).
For me easy solution was CPU threads having their own streams. But it seems, that streams from different CPU threads don't want to work concurrently.
[quote name='apostglen46' date='26 April 2012 - 03:33 PM' timestamp='1335454421' post='1401247']

Async memcpys are asynchronous to the host (they don't block CPU computation).Why do you need to manually create threads?





Because I want to use full possible power of CPU+GPU, so I want to force to work GPU+(all CPU cores).

For me easy solution was CPU threads having their own streams. But it seems, that streams from different CPU threads don't want to work concurrently.

#8
Posted 04/27/2012 06:55 AM   
I would suggest putting all streams in the same thread.
And proceed with your CPU algorithm as usual.What will probably resolve any thread-streams problem.
I would suggest putting all streams in the same thread.

And proceed with your CPU algorithm as usual.What will probably resolve any thread-streams problem.

#9
Posted 04/28/2012 02:08 PM   
Scroll To Top