GPUWorker master/slave multi-gpu approach
  1 / 7    
I'm (finally) gearing up to write some muli-gpu code. As you might know, the CUDA model is one thread = one GPU context, which is suited to a peer model of threading. That is fine in many cases, but my application structure is very modular and pretty much incompatible with a peer threading model. Sure, I could change things, but doing so will introduce a lot of complications I would prefer to avoid. Thus I needed some way to use a master/slave thread approach, where a worker thread holds a CUDA context and the master thread can send messages to many slave threads. GPUWorker was born. Since this may be useful to someone else (and the code is open source), I thought I'd share it with everyone.

Advantages:
+ A single master thread can call CUDA runtime and kernel functions on multiple GPUs
+ [b]ANY[/b] CUDA runtime function (actually, any function returning cudaError_t) can be called in the worker thread easily with a simple syntax
+ No performance difference from straight CUDA calls (in realistic situations, see performance tests below)
+ Works in windows and linux
Disadvantages:
- A slight extra latency is added to synchronous calls (due to OS thread scheduling)

Example:
[code]
GPUWorker gpu0(0);
GPUWorker gpu1(1);

// allocate data
int *d_data0;
gpu0.call(bind(cudaMalloc, (void**)((void*)&d_data0), sizeof(int)*N));
int *d_data1;
gpu1.call(bind(cudaMalloc, (void**)((void*)&d_data1), sizeof(int)*N));

// call kernel
gpu0.callAsync(bind(kernel_caller, d_data0, N));
gpu1.callAsync(bind(kernel_caller, d_data1, N));
[/code]

[b]Get the code[/b]
[url="http://trac2.assembla.com/hoomd/browser/branches/gpu-iface-rewrite/src/utils/GPUWorker.h?rev=994"]http://trac2.assembla.com/hoomd/browser/br...orker.h?rev=994[/url]
[url="http://trac2.assembla.com/hoomd/browser/branches/gpu-iface-rewrite/src/utils/GPUWorker.cc?rev=994"]http://trac2.assembla.com/hoomd/browser/br...rker.cc?rev=994[/url]
Using the code is easy: just compile GPUWorker.cc into your project. Note that you probably want to remove the #ifdef USE_CUDA macro guard, this is used in HOOMD for CPU only builds. You also need to have boost (www.boost.org) installed and link against the boost thread library.

The code is part of HOOMD which is released under an open source license: see the file for the details. The code also contains extensive documentation in doxygen style code comments.

[b]Performance tests[/b]

All the mutex locks, context switches, etc... do add up to a small bit of extra overhead for each call. This is most apparent when making synchronous calls. The simplest test I can think of to measure the overhead is to repeatedly copy 4 bytes from the device to the host. Here are the results (tested in 64-bit linux on a single GPU of the Tesla D870):
[code]
GPUWorker latency test
Time per call 34.431 us

Standard latency test
Time per call 24.381 us
[/code]
As you can see, the increased latency is significant. GPUWorker is not for you if your application depends on the best possible latency is such operations.

However, in more realistic situations (at least for my application) making thousands of ~10ms kernel calls in a row poses no performance penalty. Again, this test is on a single GPU of the Tesla D870
[code]
Standard realistic test
Time per step 11082.2 us

GPUWorker realistic test
Time per step 11080.6 us
[/code]
In multiple runs, the delta on the time measurements is +/- 5us so the difference is in the noise.

Running the same realistic test on two peer type worker threads without GPUWorker gives the following timings: (This test uses both GPUs in the D870)
[code]
Peer-based mgpu test (GPU 0)
Peer-based mgpu test (GPU 1)
Time per step (GPU 0) 11081.8 us
Time per step (GPU 1) 11079.6 us
[/code]

And running the realistic test on both GPUs using GPUWorker gives the following result:
[code]
Master/slave-based mgpu test
Time per step 11083 us
[/code]

The conclusion is simple: In realistic situations with many contiguous asynchronous calls, there is no apparent performance penalty. If you want to see the full code of the benchmarks, look here:
[url="http://trac2.assembla.com/hoomd/browser/branches/gpu-iface-rewrite/src/benchmarks/gpu_worker_bmark.cc?rev=994"]http://trac2.assembla.com/hoomd/browser/br...mark.cc?rev=994[/url]
I'm (finally) gearing up to write some muli-gpu code. As you might know, the CUDA model is one thread = one GPU context, which is suited to a peer model of threading. That is fine in many cases, but my application structure is very modular and pretty much incompatible with a peer threading model. Sure, I could change things, but doing so will introduce a lot of complications I would prefer to avoid. Thus I needed some way to use a master/slave thread approach, where a worker thread holds a CUDA context and the master thread can send messages to many slave threads. GPUWorker was born. Since this may be useful to someone else (and the code is open source), I thought I'd share it with everyone.



Advantages:

+ A single master thread can call CUDA runtime and kernel functions on multiple GPUs

+ ANY CUDA runtime function (actually, any function returning cudaError_t) can be called in the worker thread easily with a simple syntax

+ No performance difference from straight CUDA calls (in realistic situations, see performance tests below)

+ Works in windows and linux

Disadvantages:

- A slight extra latency is added to synchronous calls (due to OS thread scheduling)



Example:



GPUWorker gpu0(0);

GPUWorker gpu1(1);



// allocate data

int *d_data0;

gpu0.call(bind(cudaMalloc, (void**)((void*)&d_data0), sizeof(int)*N));

int *d_data1;

gpu1.call(bind(cudaMalloc, (void**)((void*)&d_data1), sizeof(int)*N));



// call kernel

gpu0.callAsync(bind(kernel_caller, d_data0, N));

gpu1.callAsync(bind(kernel_caller, d_data1, N));




Get the code

http://trac2.assembla.com/hoomd/browser/br...orker.h?rev=994

http://trac2.assembla.com/hoomd/browser/br...rker.cc?rev=994

Using the code is easy: just compile GPUWorker.cc into your project. Note that you probably want to remove the #ifdef USE_CUDA macro guard, this is used in HOOMD for CPU only builds. You also need to have boost (www.boost.org) installed and link against the boost thread library.



The code is part of HOOMD which is released under an open source license: see the file for the details. The code also contains extensive documentation in doxygen style code comments.



Performance tests



All the mutex locks, context switches, etc... do add up to a small bit of extra overhead for each call. This is most apparent when making synchronous calls. The simplest test I can think of to measure the overhead is to repeatedly copy 4 bytes from the device to the host. Here are the results (tested in 64-bit linux on a single GPU of the Tesla D870):



GPUWorker latency test

Time per call 34.431 us



Standard latency test

Time per call 24.381 us


As you can see, the increased latency is significant. GPUWorker is not for you if your application depends on the best possible latency is such operations.



However, in more realistic situations (at least for my application) making thousands of ~10ms kernel calls in a row poses no performance penalty. Again, this test is on a single GPU of the Tesla D870



Standard realistic test

Time per step 11082.2 us



GPUWorker realistic test

Time per step 11080.6 us


In multiple runs, the delta on the time measurements is +/- 5us so the difference is in the noise.



Running the same realistic test on two peer type worker threads without GPUWorker gives the following timings: (This test uses both GPUs in the D870)



Peer-based mgpu test (GPU 0)

Peer-based mgpu test (GPU 1)

Time per step (GPU 0) 11081.8 us

Time per step (GPU 1) 11079.6 us




And running the realistic test on both GPUs using GPUWorker gives the following result:



Master/slave-based mgpu test

Time per step 11083 us




The conclusion is simple: In realistic situations with many contiguous asynchronous calls, there is no apparent performance penalty. If you want to see the full code of the benchmarks, look here:

http://trac2.assembla.com/hoomd/browser/br...mark.cc?rev=994

#1
Posted 05/07/2008 07:09 PM   
Sir, your kindness and ingenuity have put goosebumps on my wretched body. Thank you! This is very valuable, and is probably so for a great deal of people out there.
Sir, your kindness and ingenuity have put goosebumps on my wretched body. Thank you! This is very valuable, and is probably so for a great deal of people out there.

#2
Posted 05/07/2008 07:59 PM   
I would like to express the same feelings :)

Very nice indeed. One of my projects might benefit from this enormously. Hmmm, thinking a bit more about this, now I want to stuff as many CUDA cards into my machine as possible, so this is going to be an expensive library for my boss...
I would like to express the same feelings :)



Very nice indeed. One of my projects might benefit from this enormously. Hmmm, thinking a bit more about this, now I want to stuff as many CUDA cards into my machine as possible, so this is going to be an expensive library for my boss...

#3
Posted 05/07/2008 08:26 PM   
This functionality looks pretty useful. I wonder if it would be worthwhile to integrate this into the CuPP project. Given that it and HOOMD both use BSD-style licenses, it's probably not an issue of logistics.
This functionality looks pretty useful. I wonder if it would be worthwhile to integrate this into the CuPP project. Given that it and HOOMD both use BSD-style licenses, it's probably not an issue of logistics.

#4
Posted 05/09/2008 03:09 AM   
Thanks MisterAnderson for this excellent tool! I wrote a GRAPE6-like library, and wanted to extend it for multiGPU support. And standard way, as presented in SDK, is too awkward to implement; probably, I have similar problem as in HOOMD.

GPUWorker solved my problem! I just lose 5 GFLOP/s per GPU, and instead of 250, I am getting 245GFLOP/s, but it means from two I'll get nearly 490!
=== UPDATE ===
I do not loose 5GFLOP/s per GPU. The C++ part of code was compiles with -O0 flags and compared against -O3 compiled code. After comparing apples with apples, i.e. -O3 with -O3, there is nearly no loss of performance!
=== UPDATE ===

Great job, very appreciated! Are there any papers where it is implemented, so that I could cite it in my paper which will be soon published?

Cheers,
Evghenii
Thanks MisterAnderson for this excellent tool! I wrote a GRAPE6-like library, and wanted to extend it for multiGPU support. And standard way, as presented in SDK, is too awkward to implement; probably, I have similar problem as in HOOMD.



GPUWorker solved my problem! I just lose 5 GFLOP/s per GPU, and instead of 250, I am getting 245GFLOP/s, but it means from two I'll get nearly 490!

=== UPDATE ===

I do not loose 5GFLOP/s per GPU. The C++ part of code was compiles with -O0 flags and compared against -O3 compiled code. After comparing apples with apples, i.e. -O3 with -O3, there is nearly no loss of performance!

=== UPDATE ===



Great job, very appreciated! Are there any papers where it is implemented, so that I could cite it in my paper which will be soon published?



Cheers,

Evghenii

#5
Posted 06/14/2008 09:16 PM   
Cool, I'm glad to hear it's working out for you. I've got the single-gpu HOOMD switched over to GPUWorker as a first step and as you noticed in your code, there are no performance penalties when compiling with optimizations enabled.

Here is the reference for the HOOMD paper.
[url="http://dx.doi.org/10.1016/j.jcp.2008.01.047"]http://dx.doi.org/10.1016/j.jcp.2008.01.047[/url]
Journal of Computational Physics 227 (2008) 5342-5359
Cool, I'm glad to hear it's working out for you. I've got the single-gpu HOOMD switched over to GPUWorker as a first step and as you noticed in your code, there are no performance penalties when compiling with optimizations enabled.



Here is the reference for the HOOMD paper.

http://dx.doi.org/10.1016/j.jcp.2008.01.047

Journal of Computational Physics 227 (2008) 5342-5359

#6
Posted 06/15/2008 01:23 PM   
The GPUWorker class is really useful, but I have compatibility issues with Boost libraries.
Here on my system I can handle them just fine, but the final program will be run on a remote Tesla: my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables, I have to recompile the whole software.

Do you have a version w/o Boost (ie, using plain pthread) ?
I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.
May you help me, please?
The GPUWorker class is really useful, but I have compatibility issues with Boost libraries.

Here on my system I can handle them just fine, but the final program will be run on a remote Tesla: my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables, I have to recompile the whole software.



Do you have a version w/o Boost (ie, using plain pthread) ?

I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.

May you help me, please?

#7
Posted 06/28/2008 02:11 PM   
[quote]my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables
[/quote]
Does your distribution have a 32-bit compatibility library for boost? I think that most redhat type distributions do, although I could be wrong.

For HOOMD, I just statically link the boost libraries for the distribution executable.

[quote name='spg' date='Jun 28 2008, 08:11 AM']Do you have a version w/o Boost (ie, using plain pthread) ?
I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.
May you help me, please?
[right][snapback]401691[/snapback][/right]
[/quote]
Before writing this, I did find a few alternatives to boost::bind that were almost as general. IIRC, the best phrase to search for was "function delegate". Sorry, I don't recall any specifics about which libraries I thought promising.
my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables



Does your distribution have a 32-bit compatibility library for boost? I think that most redhat type distributions do, although I could be wrong.



For HOOMD, I just statically link the boost libraries for the distribution executable.



[quote name='spg' date='Jun 28 2008, 08:11 AM']Do you have a version w/o Boost (ie, using plain pthread) ?

I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.

May you help me, please?

[snapback]401691[/snapback]




Before writing this, I did find a few alternatives to boost::bind that were almost as general. IIRC, the best phrase to search for was "function delegate". Sorry, I don't recall any specifics about which libraries I thought promising.

#8
Posted 06/28/2008 02:58 PM   
[quote name='spg' date='Jun 28 2008, 04:11 PM']The GPUWorker class is really useful, but I have compatibility issues with Boost libraries.
Here on my system I can handle them just fine, but the final program will be run on a remote Tesla: my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables, I have to recompile the whole software.

Do you have a version w/o Boost (ie, using plain pthread) ?
I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.
May you help me, please?
[right][snapback]401691[/snapback][/right]
[/quote]

This is quite strange. I have no problem compiling my code, which uses GPUWorker, on both 32bit and 64bit systems. What kind of problems you're running into while compiling your code on 64bit system?
[quote name='spg' date='Jun 28 2008, 04:11 PM']The GPUWorker class is really useful, but I have compatibility issues with Boost libraries.

Here on my system I can handle them just fine, but the final program will be run on a remote Tesla: my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables, I have to recompile the whole software.



Do you have a version w/o Boost (ie, using plain pthread) ?

I've looked at the code; it's not long, I could convert it manually. The main problem would be to implement a replacement to `Boost::bind`.

May you help me, please?

[snapback]401691[/snapback]






This is quite strange. I have no problem compiling my code, which uses GPUWorker, on both 32bit and 64bit systems. What kind of problems you're running into while compiling your code on 64bit system?

#9
Posted 06/28/2008 03:16 PM   
Sorry, the problem is not on compiling.
I meant that I can compile it under my 32 bit system, but then I cannot run the executable on the 64bit system because there is no 32 bit CUDA installed (hence execution fails by lack of dynamic libraries needed).
Sorry, the problem is not on compiling.

I meant that I can compile it under my 32 bit system, but then I cannot run the executable on the 64bit system because there is no 32 bit CUDA installed (hence execution fails by lack of dynamic libraries needed).

#10
Posted 06/28/2008 03:20 PM   
[quote name='spg' date='Jun 28 2008, 05:20 PM']Sorry, the problem is not on compiling.
I meant that I can compile it under my 32 bit system, but then I cannot run the executable on the 64bit system because there is no 32 bit CUDA installed (hence execution fails by lack of dynamic libraries needed).
[right][snapback]401714[/snapback][/right]
[/quote]

Sorry, but I am not fully following you. I have both 32bit and 64bit systems available. On 32bit I installed both 32bit CUDA and libboost_trhead, and on 64bit I installed both 64bit CUDA, and libboost_thread. In this case, my code compiles & runs on both 64bit and 32bit.

What is your setup?
[quote name='spg' date='Jun 28 2008, 05:20 PM']Sorry, the problem is not on compiling.

I meant that I can compile it under my 32 bit system, but then I cannot run the executable on the 64bit system because there is no 32 bit CUDA installed (hence execution fails by lack of dynamic libraries needed).

[snapback]401714[/snapback]






Sorry, but I am not fully following you. I have both 32bit and 64bit systems available. On 32bit I installed both 32bit CUDA and libboost_trhead, and on 64bit I installed both 64bit CUDA, and libboost_thread. In this case, my code compiles & runs on both 64bit and 32bit.



What is your setup?

#11
Posted 06/28/2008 03:24 PM   
Yes, but I don't have root privileges on the remote machine, and it only have 64 bit CUDA.
However I've just noticed it has boost installed, so I would be able to compile on that, too.

In my opinion, thought, it would be more useful to have a plain implementation of GPUWorker (I don't usually code with Boost, and it take much time to me to download/compile/install it).
Actually a plain class, with just pthread and a plain implementation of Boost::bind would be more lightweight and easier to port to other systems.
Yes, but I don't have root privileges on the remote machine, and it only have 64 bit CUDA.

However I've just noticed it has boost installed, so I would be able to compile on that, too.



In my opinion, thought, it would be more useful to have a plain implementation of GPUWorker (I don't usually code with Boost, and it take much time to me to download/compile/install it).

Actually a plain class, with just pthread and a plain implementation of Boost::bind would be more lightweight and easier to port to other systems.

#12
Posted 06/28/2008 03:41 PM   
[quote]What is your setup?
[/quote]
My setups include 64-bit linux, 32-bit linux, windows XP 32, Mac OS X and Vista 64-bit. I statically link boost so that end users who download my executable don't have to go through all the headaches of installing boost.

[quote name='spg' date='Jun 28 2008, 09:41 AM']Yes, but I don't have root privileges on the remote machine, and it only have 64 bit CUDA.
However I've just noticed it has boost installed, so I would be able to compile on that, too.
[/quote]
That is probably the best solution.

[quote]In my opinion, thought, it would be more useful to have a plain implementation of GPUWorker (I don't usually code with Boost, and it take much time to me to download/compile/install it).
Actually a plain class, with just pthread and a plain implementation of Boost::bind would be more lightweight and easier to port to other systems.
[right][snapback]401726[/snapback][/right]
[/quote]
I agree, the boost requirement is the one drawback to GPUWorker. Installing boost can be difficult, even on systems that provide a package in the repository (i.e. ubuntu/debian require installing about a dozen different boost packages to get everything working, although gentoo just needs "emerge boost" :) ).

But I don't agree that using pthreads will make it more portable. I need it to run on Windows too! So any solution that is to be as portable as the original must use both a cross-platform threading and a function delegate library. Boost has both (and I was already using boost) so I went with that. And boost is very portable across many platforms.

Feel free to re-implement the code with whatever libraries you prefer to use. I don't have the time to do so myself.
What is your setup?



My setups include 64-bit linux, 32-bit linux, windows XP 32, Mac OS X and Vista 64-bit. I statically link boost so that end users who download my executable don't have to go through all the headaches of installing boost.



[quote name='spg' date='Jun 28 2008, 09:41 AM']Yes, but I don't have root privileges on the remote machine, and it only have 64 bit CUDA.

However I've just noticed it has boost installed, so I would be able to compile on that, too.



That is probably the best solution.



In my opinion, thought, it would be more useful to have a plain implementation of GPUWorker (I don't usually code with Boost, and it take much time to me to download/compile/install it).

Actually a plain class, with just pthread and a plain implementation of Boost::bind would be more lightweight and easier to port to other systems.

[snapback]401726[/snapback]




I agree, the boost requirement is the one drawback to GPUWorker. Installing boost can be difficult, even on systems that provide a package in the repository (i.e. ubuntu/debian require installing about a dozen different boost packages to get everything working, although gentoo just needs "emerge boost" :) ).



But I don't agree that using pthreads will make it more portable. I need it to run on Windows too! So any solution that is to be as portable as the original must use both a cross-platform threading and a function delegate library. Boost has both (and I was already using boost) so I went with that. And boost is very portable across many platforms.



Feel free to re-implement the code with whatever libraries you prefer to use. I don't have the time to do so myself.

#13
Posted 06/28/2008 04:39 PM   
Yes, statically linking is a great solution, I didn't think that.

Actually, I didn't mean pthreads are more portable (it is different to port pthread programs even on different UNIXes); I meant that the code would be easily modifiable to use another threading library.

However, thank you for your answer.
Yes, statically linking is a great solution, I didn't think that.



Actually, I didn't mean pthreads are more portable (it is different to port pthread programs even on different UNIXes); I meant that the code would be easily modifiable to use another threading library.



However, thank you for your answer.

#14
Posted 06/28/2008 04:57 PM   
Excellent work!
Thank you very much for your contribution!
Excellent work!

Thank you very much for your contribution!

#15
Posted 06/28/2008 07:18 PM   
  1 / 7    
Scroll To Top