CUDA 4.0 Context Sharing by Threads Impact on existing Multi-threaded Apps

Sarnath · March 7, 2011, 7:37am

As I understand, in earlier CUDA releases – If you declare a GPU array like below, each CPU thread working on the SAME DEVICE will inherit a separate context and hence a separate physical copy of the array.

__device__ int temp[CONFIGURED_MAX];

With CUDA 4.0, this “temp” array will be available per “Context” and hence all threads working on SAME DEVICE will inherit the same physical copy of the array.

So,

What happens to existing muti-threaded applications that rely on availability of separate context for each CPU thread (for the same GPU device)?

Will they not suffer from this new CUDA behaviour?

Am I missing something? Is there a separate API call to enable this new behavior?

avidday · March 7, 2011, 8:49am

As I understand it, nothing changes. There is still always a separate context per gpu. The API has been made thread safe so that a single thread can operated seamlessly with mulitple contexts without the migration overhead as happened earlier, but otherwise everything is as it was. Separate to that, Fermi Telsa GPUs can enable a peer-to-peer global memory sharing mechanism. But that doesn’t impact on what you are asking about either, I don’t think.

Sarnath · March 7, 2011, 8:54am

Avid,

I am not talking about separate contexts for different devices…

Say, I had a single GPU system. Process P has 2 Threads T1 and T2 running.
Both T1 and T2 perform CUDA operations.

Prior to 4.0, T1 and T2 will get 2 different contexts. and hence 2 different physical copies of the “temp” array (see first post).
However,
with 4.0, T1 and T2 will get the same context and hence will share the same physical copy – This can result in races (data corruption) between those threads.

I wonder how CUDA 4.0 handles this situation.

tmurray · March 7, 2011, 4:51pm

This is the one case that’s intentionally broken in 4.0. Using multiple contexts per device per process with the runtime API is generally a mistake on the developer’s part because he doesn’t understand the relationship between contexts and threads. Maintaining that model would basically ruin the ease of use enhancements for multi-GPU programming in 4.0. The only time when this behavior was expected/useful was in conjunction with exclusive mode, which is why we added exclusive-thread mode in 4.0.

If we see that there’s serious fallout from this, we may add something in a later RC. But I don’t think we will; I can’t think of a single application that intentionally used multiple contexts on the same GPU.

Sarnath · March 8, 2011, 5:02am

Thanks for clarifying… Its always good to know the potholes so we can move around…
However,
This situation can easily occur for Multi-GPU Applications that are run on single-GPU systems. Is it not?
Especially Multi-GPU apps crafted out of existing Multi-threaded Applications…
Thats the only case I can think of.
For others, it is normal to expect the developer to logically spawn as many number of threads as there are GPUs.

John_Roberts · March 8, 2011, 6:29pm

Although it probably doesn’t matter to many, we do intentionally use multiple contexts on the same GPU. We’ve found beneficial ways to exploit the old (pre CUDA 4.0) runtime behavior in multithreaded applications doing multi tasking on a single Fermi-class GPU. The host thread ↔ CUDA context affinity and isolation between multiple threads/CUDA contexts associated with a single GPU is (was) useful, not an error or obstacle.

With the new/different default behavior in CUDA 4.0 (single shared context for a GPU across multiple host threads), we’ll be (rapidly) exploring workarounds. It’s understood that there are benefits to the new CUDA 4.0 changes for many applications and developers, but ideally there would be a simple means to retain the old behavior with an explicit lightweight call to an existing API such as cudaSetDeviceFlags () using a new property/constant called something like “cudaDeviceIsolatedContext” or something like that, with the alternate state being “cudaDeviceSharedContext”.

Otherwise, my understanding is that the old behavior can be recovered in CUDA 4.0 using some additional calls into the driver API but we haven’t verified this yet.

tmurray · March 8, 2011, 8:55pm

Using the driver API is the expected workaround. The runtime will use whatever context is resident on the thread’s stack; if you do context management with the driver API and everything else with the runtime, you can have multiple contexts per GPU in 4.0.

There’s a long rationale for why we made this change. I can go into it if people are interested (there are long justifications for every part of 4.0), but it basically boils down to one context per device per process in the runtime allows multi-GPU and multithreaded behavior to be dramatically simplified.

Christopher_Cameron · March 9, 2011, 7:04am

The following is the recommended and supported mechanism for achieving this. It does involves making 3 calls to the driver API in place of 2 calls to the runtime API (and the rest of your application can/should remain written in the runtime API):

If you had a thread that did (as a short example)

cudaSetDevice(i)

cudaMalloc(.);

K<<<.>>>();

cudaThreadExit();

Then to get back the old behavior of CUDA 3.2, do

cuGetDevice(&device, i); // changed: explicitly create a context for just this thread using the driver API 

cuCtxCreate(&ctx, 0, i); //     instead of calling cudaSetDevice(i);

cudaMalloc(.);           // old: just use the runtime API -- it will implicitly use the context you've bound to the thread

K<<<.>>>();              //     (no need to use any other driver API calls or convert any part of your program to driver API)

cuCtxDestroy(ctx);       // changed: destroy the thread's context using the driver API instead of calling cudaThreadExit();

The explicitly created context will be entirely independent of all other contexts on all other threads (just like the per-thread context runtime behavior in CUDA 3.2).

All of that said, using multiple contexts on a single device comes at a substantial performance penalty and a less-substantial-but-still-nontrivial memory penalty and is strongly discouraged.

John_Roberts · March 9, 2011, 2:58pm

The following is the recommended and supported mechanism for achieving this. It does involves making 3 calls to the driver API in place of 2 calls to the runtime API (and the rest of your application can/should remain written in the runtime API):

If you had a thread that did (as a short example)
cudaSetDevice(i)

cudaMalloc(.);

K<<<.>>>();

cudaThreadExit();
Then to get back the old behavior of CUDA 3.2, do
cuGetDevice(&device, i); // changed: explicitly create a context for just this thread using the driver API 

cuCtxCreate(&ctx, 0, i); //     instead of calling cudaSetDevice(i);

cudaMalloc(.);           // old: just use the runtime API -- it will implicitly use the context you've bound to the thread

K<<<.>>>();              //     (no need to use any other driver API calls or convert any part of your program to driver API)

cuCtxDestroy(ctx);       // changed: destroy the thread's context using the driver API instead of calling cudaThreadExit();
The explicitly created context will be entirely independent of all other contexts on all other threads (just like the per-thread context runtime behavior in CUDA 3.2).

All of that said, using multiple contexts on a single device comes at a substantial performance penalty and a less-substantial-but-still-nontrivial memory penalty and is strongly discouraged.

Thanks very much Chris for the detailed information and the clear snippet. We’ll explore that approach and our other WAR’s also.

FYI we don’t see a performance hit of any significance from context switching in our application. The real performance benefits we achieve along with other benefits far outweigh the theoretical cost of the switch. But we’ll keep a close eye on it and remain flexible in case there is a sudden massive increase in context switching cost from future drivers or runtimes.