I’m (finally) gearing up to write some muli-gpu code. As you might know, the CUDA model is one thread = one GPU context, which is suited to a peer model of threading. That is fine in many cases, but my application structure is very modular and pretty much incompatible with a peer threading model. Sure, I could change things, but doing so will introduce a lot of complications I would prefer to avoid. Thus I needed some way to use a master/slave thread approach, where a worker thread holds a CUDA context and the master thread can send messages to many slave threads. GPUWorker was born. Since this may be useful to someone else (and the code is open source), I thought I’d share it with everyone.
Advantages:
-
A single master thread can call CUDA runtime and kernel functions on multiple GPUs
-
ANY CUDA runtime function (actually, any function returning cudaError_t) can be called in the worker thread easily with a simple syntax
-
No performance difference from straight CUDA calls (in realistic situations, see performance tests below)
-
Works in windows and linux
Disadvantages:
- A slight extra latency is added to synchronous calls (due to OS thread scheduling)
Example:
GPUWorker gpu0(0);
GPUWorker gpu1(1);
// allocate data
int *d_data0;
gpu0.call(bind(cudaMalloc, (void**)((void*)&d_data0), sizeof(int)*N));
int *d_data1;
gpu1.call(bind(cudaMalloc, (void**)((void*)&d_data1), sizeof(int)*N));
// call kernel
gpu0.callAsync(bind(kernel_caller, d_data0, N));
gpu1.callAsync(bind(kernel_caller, d_data1, N));
Get the code
http://trac2.assembla.com/hoomd/browser/br…orker.h?rev=994
http://trac2.assembla.com/hoomd/browser/br…rker.cc?rev=994
Using the code is easy: just compile GPUWorker.cc into your project. Note that you probably want to remove the #ifdef USE_CUDA macro guard, this is used in HOOMD for CPU only builds. You also need to have boost (www.boost.org) installed and link against the boost thread library.
The code is part of HOOMD which is released under an open source license: see the file for the details. The code also contains extensive documentation in doxygen style code comments.
Performance tests
All the mutex locks, context switches, etc… do add up to a small bit of extra overhead for each call. This is most apparent when making synchronous calls. The simplest test I can think of to measure the overhead is to repeatedly copy 4 bytes from the device to the host. Here are the results (tested in 64-bit linux on a single GPU of the Tesla D870):
GPUWorker latency test
Time per call 34.431 us
Standard latency test
Time per call 24.381 us
As you can see, the increased latency is significant. GPUWorker is not for you if your application depends on the best possible latency is such operations.
However, in more realistic situations (at least for my application) making thousands of ~10ms kernel calls in a row poses no performance penalty. Again, this test is on a single GPU of the Tesla D870
Standard realistic test
Time per step 11082.2 us
GPUWorker realistic test
Time per step 11080.6 us
In multiple runs, the delta on the time measurements is +/- 5us so the difference is in the noise.
Running the same realistic test on two peer type worker threads without GPUWorker gives the following timings: (This test uses both GPUs in the D870)
Peer-based mgpu test (GPU 0)
Peer-based mgpu test (GPU 1)
Time per step (GPU 0) 11081.8 us
Time per step (GPU 1) 11079.6 us
And running the realistic test on both GPUs using GPUWorker gives the following result:
Master/slave-based mgpu test
Time per step 11083 us
The conclusion is simple: In realistic situations with many contiguous asynchronous calls, there is no apparent performance penalty. If you want to see the full code of the benchmarks, look here: