Support for multi-threaded apps on cuda and multiple applications on cuda
Hi,

I have a few questions about the capabilities of CudaRuntime for supporting multiple contexts.

Problem 1:
Let's say I have a multi-threaded application, where each thread is a pthread. For simplicity, lets assume that there is no dependency between the threads. In other words, each
thread is independent from the other. Inside the thread function of each thread, I perform some cuda operations. Note that I am trying to run all the threads in the same GPU.
In this scenario I observe the following:

1) When these threads are started at the same time (without any delay between creation of threads), they run to completion successfully.
2) When I insert a sleep time between the creation of threads, some thread finish successfully, while others have some errors during launch. This is
consistent when I have 4 threads.

So, the question is, does cudaRuntime guarantee a correct support for multi-threaded programs? Or, in other words, does cudaRuntime handle thread context within a program correctly?

Problem 2:
Let's say I am trying to run multiple applications on a single GPU at the same time. However, again, there might be some start delays between each application. I have tested this with a
set of applications. Sometimes, this mix has multiple instances of the same application or different applications itself.
Following are the observations:

1) It runs to completion mostly when all the applications are started about the same time.
2) When there is a sleep time between the start of various applications, I notice some "unspecified launch error".

So, the question is, does cudaRuntime guarantee support for multiple applications on a single GPU at the same time?

I am using cuda 3.0 version.

If you have some information on this or have experienced this before, please share your thoughts with me. This will be of great help.


Thanks in advance!
Hi,



I have a few questions about the capabilities of CudaRuntime for supporting multiple contexts.



Problem 1:

Let's say I have a multi-threaded application, where each thread is a pthread. For simplicity, lets assume that there is no dependency between the threads. In other words, each

thread is independent from the other. Inside the thread function of each thread, I perform some cuda operations. Note that I am trying to run all the threads in the same GPU.

In this scenario I observe the following:



1) When these threads are started at the same time (without any delay between creation of threads), they run to completion successfully.

2) When I insert a sleep time between the creation of threads, some thread finish successfully, while others have some errors during launch. This is

consistent when I have 4 threads.



So, the question is, does cudaRuntime guarantee a correct support for multi-threaded programs? Or, in other words, does cudaRuntime handle thread context within a program correctly?



Problem 2:

Let's say I am trying to run multiple applications on a single GPU at the same time. However, again, there might be some start delays between each application. I have tested this with a

set of applications. Sometimes, this mix has multiple instances of the same application or different applications itself.

Following are the observations:



1) It runs to completion mostly when all the applications are started about the same time.

2) When there is a sleep time between the start of various applications, I notice some "unspecified launch error".



So, the question is, does cudaRuntime guarantee support for multiple applications on a single GPU at the same time?



I am using cuda 3.0 version.



If you have some information on this or have experienced this before, please share your thoughts with me. This will be of great help.





Thanks in advance!

#1
Posted 07/30/2010 02:55 PM   
Yes, CUDA does support multithread execution, but you have to be careful how it's used.

There can only be 1 CUDA context attached to each host thread, so calling CUDA runtime functions from multiple threads
will implicitly create multiple CUDA contexts, as is suggested by [i]3.4 Interoperability between Runtime and Driver APIs[/i]
in the programming manual.

For me, the main issue this causes is memory allocated in 1 context can't be used in another, but there's a moderately complex way to overcome this.
Yes, CUDA does support multithread execution, but you have to be careful how it's used.



There can only be 1 CUDA context attached to each host thread, so calling CUDA runtime functions from multiple threads

will implicitly create multiple CUDA contexts, as is suggested by 3.4 Interoperability between Runtime and Driver APIs

in the programming manual.



For me, the main issue this causes is memory allocated in 1 context can't be used in another, but there's a moderately complex way to overcome this.

#2
Posted 07/30/2010 06:02 PM   
Hi, thank you for your answer.

I am aware of the fact that CUDA runtime creates a context for each host threads. In fact, my application consists of multiple threads and therefore spawns multiple CUDA contexts. Moreover, the threads are completely independent, and they do not share any variable.
The problems I experience are not in accessing memory. When the threads are not started simultaneously, I get mostly CUDA launch errors.

It seems to me that the behavior of the CUDA runtime is not deterministic: the execution sometimes successfully runs to completion, and sometimes not.
Hi, thank you for your answer.



I am aware of the fact that CUDA runtime creates a context for each host threads. In fact, my application consists of multiple threads and therefore spawns multiple CUDA contexts. Moreover, the threads are completely independent, and they do not share any variable.

The problems I experience are not in accessing memory. When the threads are not started simultaneously, I get mostly CUDA launch errors.



It seems to me that the behavior of the CUDA runtime is not deterministic: the execution sometimes successfully runs to completion, and sometimes not.

#3
Posted 07/30/2010 09:12 PM   
What's the error? cudaGetErrorString(cudaGetLastError())
What's the error? cudaGetErrorString(cudaGetLastError())

#4
Posted 07/30/2010 09:15 PM   
unspecified launch failure

[quote name='Uncle Joe' post='1096451' date='Jul 30 2010, 05:15 PM']What's the error? cudaGetErrorString(cudaGetLastError())[/quote]
unspecified launch failure



[quote name='Uncle Joe' post='1096451' date='Jul 30 2010, 05:15 PM']What's the error? cudaGetErrorString(cudaGetLastError())

#5
Posted 07/30/2010 09:52 PM   
For completeness, this is the function that I run in each pthread. Some of them run to completion, and some report the unspecified launch failure. As I mentioned, the execution outcome is not deterministic.

void gpu_func(int ndim, int num_rows, int num_cols, float* in_mat, float* out_mat, int num_gpu_blocks, int num_gpu_threads, int tid, FILE *log) {

cudaError_t rc;

float* d_in_mat, *d_out_mat;

clk_init_start = rtclock();

rc = cudaSetDevice(0);

if (rc != cudaSuccess) {
printf("%d, cudaSetDevice() :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_init_end = rtclock();
clk_init_time = clk_init_end - clk_init_start;

printf("***Time for device initialization:%lf***\n",clk_init_end - clk_init_start);


rc = cudaMalloc((void**) &d_in_mat, sizeof(float)*num_rows*num_cols);

if (rc != cudaSuccess) {
printf("%d, cudaMalloc(d_in_mat) :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

rc = cudaMalloc((void**) &d_out_mat, sizeof(float)*num_rows*num_cols);

if (rc != cudaSuccess){
printf("%d, cudaMalloc(d_out_mat) :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

printf("%d:: d_in_mat=%p\n",tid,d_in_mat);
printf("%d:: d_out_mat=%p\n",tid,d_out_mat);

clk_copy_start = rtclock();

rc = cudaMemcpy(d_in_mat, in_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyHostToDevice);

if (rc != cudaSuccess) {
printf("%d, xfer1 :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
fprintf(log,"Data Xfer1:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);
fflush(log);

clk_copy_start = rtclock();

rc = cudaMemcpy(d_out_mat, out_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyHostToDevice);

if (rc != cudaSuccess) {
printf("%d, xfer2 :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
fprintf(log,"Data Xfer2:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);
fflush(log);

printf("No. of blocks:%d, No. of threads:%d\n", num_gpu_blocks, num_gpu_threads);

dim3 grid(num_gpu_blocks, 1, 1);
dim3 thread(num_gpu_threads, 1, 1);

clk_comp_start = rtclock();

Jacobi_device<<<grid, thread, 1024>>>(ndim, num_rows, num_cols, d_in_mat, d_out_mat, num_gpu_blocks, num_gpu_threads);
cudaThreadSynchronize();
rc = cudaGetLastError();
if (rc != cudaSuccess){
printf("%d, kernel :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_comp_end = rtclock();
clk_comp_time = clk_comp_end - clk_comp_start;

//printf("***Time for computation:%lf\n***",clk_comp_time);
fprintf(log,"Computation:%d\t%1f\t%lf\t%lf\n", tid,clk_comp_start, clk_comp_end, clk_comp_time);
fflush(log);

clk_copy_start = rtclock();

rc = cudaMemcpy(out_mat, d_out_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyDeviceToHost);
if (rc != cudaSuccess) {
printf("%d, xfer3 :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

clk_copy_end = rtclock();
total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;
//printf("Data Xfer3:%1f\n", clk_copy_time);
fprintf(log,"Data Xfer3:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);
//printf("***Time for data copy:%lf\n***",total_copy_time);
fflush(log);

rc = cudaFree(d_in_mat);
if (rc != cudaSuccess) {
printf("%d, cudaFree(d_in_mat) :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}
rc = cudaFree(d_out_mat);
if (rc != cudaSuccess) {
printf("%d, cudaFree(d_out_mat) :: %s\n",tid,cudaGetErrorString(rc));
fflush(log);
return;
}

}
For completeness, this is the function that I run in each pthread. Some of them run to completion, and some report the unspecified launch failure. As I mentioned, the execution outcome is not deterministic.



void gpu_func(int ndim, int num_rows, int num_cols, float* in_mat, float* out_mat, int num_gpu_blocks, int num_gpu_threads, int tid, FILE *log) {



cudaError_t rc;



float* d_in_mat, *d_out_mat;



clk_init_start = rtclock();



rc = cudaSetDevice(0);



if (rc != cudaSuccess) {

printf("%d, cudaSetDevice() :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



clk_init_end = rtclock();

clk_init_time = clk_init_end - clk_init_start;



printf("***Time for device initialization:%lf***\n",clk_init_end - clk_init_start);





rc = cudaMalloc((void**) &d_in_mat, sizeof(float)*num_rows*num_cols);



if (rc != cudaSuccess) {

printf("%d, cudaMalloc(d_in_mat) :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



rc = cudaMalloc((void**) &d_out_mat, sizeof(float)*num_rows*num_cols);



if (rc != cudaSuccess){

printf("%d, cudaMalloc(d_out_mat) :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



printf("%d:: d_in_mat=%p\n",tid,d_in_mat);

printf("%d:: d_out_mat=%p\n",tid,d_out_mat);



clk_copy_start = rtclock();



rc = cudaMemcpy(d_in_mat, in_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyHostToDevice);



if (rc != cudaSuccess) {

printf("%d, xfer1 :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



clk_copy_end = rtclock();

total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;

fprintf(log,"Data Xfer1:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);

fflush(log);



clk_copy_start = rtclock();



rc = cudaMemcpy(d_out_mat, out_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyHostToDevice);



if (rc != cudaSuccess) {

printf("%d, xfer2 :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



clk_copy_end = rtclock();

total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;

fprintf(log,"Data Xfer2:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);

fflush(log);



printf("No. of blocks:%d, No. of threads:%d\n", num_gpu_blocks, num_gpu_threads);



dim3 grid(num_gpu_blocks, 1, 1);

dim3 thread(num_gpu_threads, 1, 1);



clk_comp_start = rtclock();



Jacobi_device<<<grid, thread, 1024>>>(ndim, num_rows, num_cols, d_in_mat, d_out_mat, num_gpu_blocks, num_gpu_threads);

cudaThreadSynchronize();

rc = cudaGetLastError();

if (rc != cudaSuccess){

printf("%d, kernel :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



clk_comp_end = rtclock();

clk_comp_time = clk_comp_end - clk_comp_start;



//printf("***Time for computation:%lf\n***",clk_comp_time);

fprintf(log,"Computation:%d\t%1f\t%lf\t%lf\n", tid,clk_comp_start, clk_comp_end, clk_comp_time);

fflush(log);



clk_copy_start = rtclock();



rc = cudaMemcpy(out_mat, d_out_mat, sizeof(float)*num_rows*num_cols, cudaMemcpyDeviceToHost);

if (rc != cudaSuccess) {

printf("%d, xfer3 :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



clk_copy_end = rtclock();

total_copy_time += clk_copy_time = clk_copy_end - clk_copy_start;

//printf("Data Xfer3:%1f\n", clk_copy_time);

fprintf(log,"Data Xfer3:%d\t%1f\t%lf\t%lf\n", tid,clk_copy_start, clk_copy_end, clk_copy_time);

//printf("***Time for data copy:%lf\n***",total_copy_time);

fflush(log);



rc = cudaFree(d_in_mat);

if (rc != cudaSuccess) {

printf("%d, cudaFree(d_in_mat) :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}

rc = cudaFree(d_out_mat);

if (rc != cudaSuccess) {

printf("%d, cudaFree(d_out_mat) :: %s\n",tid,cudaGetErrorString(rc));

fflush(log);

return;

}



}

#6
Posted 07/30/2010 09:55 PM   
Based on my experience, it sounds like you might have an out of bounds memory bug somewhere that shows up later (Unspecified launch error).

Have you tried to comment out pieces of your code in a divide & conqueror way?
Based on my experience, it sounds like you might have an out of bounds memory bug somewhere that shows up later (Unspecified launch error).



Have you tried to comment out pieces of your code in a divide & conqueror way?

#7
Posted 07/30/2010 10:03 PM   
[quote name='Uncle Joe' date='30 July 2010 - 08:02 PM' timestamp='1280512956' post='1096349']
Yes, CUDA does support multithread execution, but you have to be careful how it's used.

There can only be 1 CUDA context attached to each host thread, so calling CUDA runtime functions from multiple threads
will implicitly create multiple CUDA contexts, as is suggested by [i]3.4 Interoperability between Runtime and Driver APIs[/i]
in the programming manual.

For me, the main issue this causes is memory allocated in 1 context can't be used in another, but there's a moderately complex way to overcome this.
[/quote]

Hi,

I am trying to share buffers memory, that was allocated by only one thread, between two threads. Because of the two different contexts, the second thread cannot access the device memory. What is the "moderately complex" trick you use to overcome this?
Thanks in advance,

Sam
[quote name='Uncle Joe' date='30 July 2010 - 08:02 PM' timestamp='1280512956' post='1096349']

Yes, CUDA does support multithread execution, but you have to be careful how it's used.



There can only be 1 CUDA context attached to each host thread, so calling CUDA runtime functions from multiple threads

will implicitly create multiple CUDA contexts, as is suggested by 3.4 Interoperability between Runtime and Driver APIs

in the programming manual.



For me, the main issue this causes is memory allocated in 1 context can't be used in another, but there's a moderately complex way to overcome this.





Hi,



I am trying to share buffers memory, that was allocated by only one thread, between two threads. Because of the two different contexts, the second thread cannot access the device memory. What is the "moderately complex" trick you use to overcome this?

Thanks in advance,



Sam

#8
Posted 01/21/2011 11:12 AM   
Fundamentally, device memory allocations are context local and contexts are thread local. So there is no direct way of sharing device pointers between different threads.

You can use portable pinned memory to make a common buffer for all threads to share, and use that as a staging point for device to device transfers between contexts.
Fundamentally, device memory allocations are context local and contexts are thread local. So there is no direct way of sharing device pointers between different threads.



You can use portable pinned memory to make a common buffer for all threads to share, and use that as a staging point for device to device transfers between contexts.

#9
Posted 01/21/2011 11:23 AM   
@TRT_Sam: A very useful class seems to be the 'GPUWorker', where one CPU GPU worker Thread objects manages one Cuda context related to one GPU.
See http://forums.nvidia.com/index.php?showtopic=66598
or http://lissom.googlecode.com/svn/trunk/MPICUDALissomV1/src/GPUWorker.cc
@TRT_Sam: A very useful class seems to be the 'GPUWorker', where one CPU GPU worker Thread objects manages one Cuda context related to one GPU.

See http://forums.nvidia.com/index.php?showtopic=66598

or http://lissom.googlecode.com/svn/trunk/MPICUDALissomV1/src/GPUWorker.cc

#10
Posted 01/21/2011 01:12 PM   
GPUWorker can be handy, but also adds a lot of complications to your code and build process.

Tim has hinted that this type of management may be getting easier soon..... Presumably this means that a future version of CUDA will no longer tie threads to contexts so tightly.
GPUWorker can be handy, but also adds a lot of complications to your code and build process.



Tim has hinted that this type of management may be getting easier soon..... Presumably this means that a future version of CUDA will no longer tie threads to contexts so tightly.

#11
Posted 01/21/2011 04:57 PM   
Sam, the current method to use the same CUDA context from multiple host threads is to use cuCtxPushCurrent() & cuCtxPopCurrent()

to bind the context to the host thread each time it wants to use it. This doesn't really cost any performance from my experience. Only more code complexity and danger of forgetting to acquire/release the context.

You would use it like this:

[code]
main()
{

cuCtxCreate() // do this before all CUDA calls so that CUDA runtime layer will use this context instead of creating an implicit context
global_context = cuCtxPopCurrent()

cuCtxPushCurrent(global_context)
cudaCode()
cuCtxPopCurrent()

}
[/code]
Sam, the current method to use the same CUDA context from multiple host threads is to use cuCtxPushCurrent() & cuCtxPopCurrent()



to bind the context to the host thread each time it wants to use it. This doesn't really cost any performance from my experience. Only more code complexity and danger of forgetting to acquire/release the context.



You would use it like this:





main()

{



cuCtxCreate() // do this before all CUDA calls so that CUDA runtime layer will use this context instead of creating an implicit context

global_context = cuCtxPopCurrent()



cuCtxPushCurrent(global_context)

cudaCode()

cuCtxPopCurrent()



}

#12
Posted 01/21/2011 06:47 PM   
The next release of CUDA offers a new hope for multithreaded and multi-GPU developers. :)
The next release of CUDA offers a new hope for multithreaded and multi-GPU developers. :)

#13
Posted 01/21/2011 09:09 PM   
[quote name='tmurray' date='21 January 2011 - 02:09 PM' timestamp='1295644146' post='1181345']
The next release of CUDA offers a new hope for multithreaded and multi-GPU developers. :)
[/quote]

CUDA Episode IV: A New Hope
[quote name='tmurray' date='21 January 2011 - 02:09 PM' timestamp='1295644146' post='1181345']

The next release of CUDA offers a new hope for multithreaded and multi-GPU developers. :)





CUDA Episode IV: A New Hope

#14
Posted 01/24/2011 08:49 PM   
Scroll To Top