HyperQ and MPI
Hi everyone, I’m trying to make a little CUDA sample showing the HyperQ improvement when the GPU is attacked by several MPI processes. My case is really basic: only one kernel launched on my Tesla K20 by each MPI process. The kernel does not use all the GPU capabilities (occupancy around 6%), so, theoretically some executions should be done concurrently. It seems to be easy but after many tries it is still impossible to obtain the expected behavior, all the kernels are always executed serially… My questions: - Maybe (or surely :)) I'm forgetting something in my implementation… Is there a special trick to activate HyperQ on GK110 arch? - Does someone have a simple sample which shows me how to use HyperQ feature with MPI? My configuration: - Ubuntu 12.04 - Tesla K20 - Latest CUDA driver & toolkit - Open MPI 1.4.3 Thanks for your help ! Guix
Hi everyone,

I’m trying to make a little CUDA sample showing the HyperQ improvement when the GPU is attacked by several MPI processes. My case is really basic: only one kernel launched on my Tesla K20 by each MPI process. The kernel does not use all the GPU capabilities (occupancy around 6%), so, theoretically some executions should be done concurrently. It seems to be easy but after many tries it is still impossible to obtain the expected behavior, all the kernels are always executed serially…

My questions:
- Maybe (or surely :)) I'm forgetting something in my implementation… Is there a special trick to activate HyperQ on GK110 arch?
- Does someone have a simple sample which shows me how to use HyperQ feature with MPI?

My configuration:
- Ubuntu 12.04
- Tesla K20
- Latest CUDA driver & toolkit
- Open MPI 1.4.3

Thanks for your help !
Guix

#1
Posted 02/08/2013 09:45 AM   
Occupancy only indicates how many threads are running on the GPU compared to the theoretical maximum. If you have one thread per threadblock, but for example allocate all available shared memory on an SMX to that thread, you can run only 13 threads on the 13 SMXes of your Tesla K20. If this isn't your bottleneck, check the output of `nvidia-smi -q`, and make sure the "Compute Mode" of the K20 is set correctly. I guess that in your case the "0/DEFAULT" option is the best choice.
Occupancy only indicates how many threads are running on the GPU compared to the theoretical maximum. If you have one thread per threadblock, but for example allocate all available shared memory on an SMX to that thread, you can run only 13 threads on the 13 SMXes of your Tesla K20.

If this isn't your bottleneck, check the output of `nvidia-smi -q`, and make sure the "Compute Mode" of the K20 is set correctly. I guess that in your case the "0/DEFAULT" option is the best choice.

Parallel Architecture Research in Eindhoven - http://parse.ele.tue.nl

#2
Posted 02/08/2013 10:56 AM   
The SDK should have a test case for the HyperQ feature. eyal
The SDK should have a test case for the HyperQ feature.

eyal

#3
Posted 02/08/2013 11:54 AM   
Note that HyperQ by default only works for kernels launched from different streams [b]in the same process[/b]. There is a tool that enables multiple MPI ranks on a node to run kernels in parallel on the same GPU (called proxy), but documentation is exceedingly sparse. The only place I've seen it mentioned is in the GTC talk "S0351 - Strong Scaling for Molecular Dynamics Applications". CUDA 5.0 comes with the executables "nvidia-cuda-proxy-control nvidia-cuda-proxy-server" that don't even have --help options.
Note that HyperQ by default only works for kernels launched from different streams in the same process. There is a tool that enables multiple MPI ranks on a node to run kernels in parallel on the same GPU (called proxy), but documentation is exceedingly sparse. The only place I've seen it mentioned is in the GTC talk "S0351 - Strong Scaling for Molecular Dynamics Applications". CUDA 5.0 comes with the executables "nvidia-cuda-proxy-control nvidia-cuda-proxy-server" that don't even have --help options.

#4
Posted 02/08/2013 01:17 PM   
nvidia-cuda-proxy-something sounds really interesting. Even though it does not have a --help option, you can run `man nvidia-cuda-proxy-control`. This is what the description is: [quote]CUDA proxy is a feature that allows multiple CUDA processes to share a single GPU context. A CUDA program runs in proxy mode if the proxy control daemon is running on the system. When a CUDA program starts, it tries to connect to the daemon, which will then create a proxy server process for the connecting client if one does not exist for the user (UID) who launched the client. Each user (UID) has its own proxy server process. The proxy server creates the shared GPU context, manages its clients, and issues work to the GPU on behalf of its clients. The proxy mode should be transparent to CUDA programs.[/quote] Options and some documentation you can find via `man nvidia-cuda-proxy-control`
nvidia-cuda-proxy-something sounds really interesting. Even though it does not have a --help option, you can run `man nvidia-cuda-proxy-control`. This is what the description is:

CUDA proxy is a feature that allows multiple CUDA processes to share a single GPU context. A CUDA program runs in proxy mode if the proxy control daemon is running on the system. When a CUDA program starts, it tries to connect to the daemon, which will then create a proxy server process for the connecting client if one does not exist for the user (UID) who launched the client. Each user (UID) has its own proxy server process. The proxy server creates the shared GPU context, manages its clients, and issues work to the GPU on behalf of its clients. The proxy mode should be transparent to CUDA programs.

Options and some documentation you can find via `man nvidia-cuda-proxy-control`

Parallel Architecture Research in Eindhoven - http://parse.ele.tue.nl

#5
Posted 02/08/2013 01:46 PM   
Ok, thank you for answers. [i]@Gert-Jan[/i]: You are right about occupancy, I said that to be short. My kernel use a few of registers and does not allocate shared mem (profile is in atachement). The Compute Mode is set on "default" and actually it seems to be best in my case. [i]@eyalhier74[/i]: There is an HyperQ sample provided with CUDA 5.0 but it shows how to launches many kernel in different streams [b]in the same process[/b]. In my case I have several MPI processes. [i]@DrAnderson42[/i]: What ?? Multiple MPI access does not work by default ? Thank you for this information, I did not know. Where did you found it ? It should be written in [b]bold[/b] in the GK110 Whitpaper... Actually documentation is exceedingly sparse, it's a pity. [i]@Everyone[/i]: I will try with "nvidia-cuda-proxy-something". I'll let you know ! Thanks, Guix [img]file:///home/user/Bureau/Screenshot%20from%202013-02-08%2015:20:05.png[/img]
Ok, thank you for answers.

@Gert-Jan: You are right about occupancy, I said that to be short. My kernel use a few of registers and does not allocate shared mem (profile is in atachement). The Compute Mode is set on "default" and actually it seems to be best in my case.

@eyalhier74: There is an HyperQ sample provided with CUDA 5.0 but it shows how to launches many kernel in different streams in the same process. In my case I have several MPI processes.

@DrAnderson42: What ?? Multiple MPI access does not work by default ? Thank you for this information, I did not know. Where did you found it ? It should be written in bold in the GK110 Whitpaper... Actually documentation is exceedingly sparse, it's a pity.

@Everyone: I will try with "nvidia-cuda-proxy-something". I'll let you know !

Thanks,
Guix

Image

#6
Posted 02/08/2013 03:48 PM   
Hi evreryone, I tried to use the "nvidia-cuda-proxy-control" and "nvidia-cuda-proxy-server" executable to run my CUDA/MPI application and it is not a success... I can run "nvidia-cuda-proxy-control" and launch the proxy control daemon but then I am lost. If I try to launch my CUDA/MPI application I get an error message: [i]all cuda-capable devices are busy or unavailable[/i]. Surely I must use the "nvidia-cuda-proxy-server" too but I do not know how it works and what it does because there is no documentation about this. There is only the man of "nvidia-cuda-proxy-control" which is realy short. Does anyone ever used "nvidia-cuda-proxy-something" or has VIP documentation which can help me ? Thanks in advance, Guix
Hi evreryone,

I tried to use the "nvidia-cuda-proxy-control" and "nvidia-cuda-proxy-server" executable to run my CUDA/MPI application and it is not a success...
I can run "nvidia-cuda-proxy-control" and launch the proxy control daemon but then I am lost. If I try to launch my CUDA/MPI application I get an error message: all cuda-capable devices are busy or unavailable.
Surely I must use the "nvidia-cuda-proxy-server" too but I do not know how it works and what it does because there is no documentation about this. There is only the man of "nvidia-cuda-proxy-control" which is realy short.

Does anyone ever used "nvidia-cuda-proxy-something" or has VIP documentation which can help me ?

Thanks in advance,
Guix

#7
Posted 02/11/2013 02:34 PM   
@Gert-Jan: Aha! I checked for man pages, but was running on a system that didn't have them installed for some reason. Now I see them. @Guix: I learned about it in that GTC talk I mentioned. You might want to watch the video recording of the talk to see for yourself. But don't expect too much on proxy, it is briefly discussed and without details. I only know that it exists, and I have never tried to make it work. It seems like its a very beta feature and not intended for mass-consumption yet. The man page mentions log files, did you check those? Maybe there is something there to help you.
@Gert-Jan: Aha! I checked for man pages, but was running on a system that didn't have them installed for some reason. Now I see them.

@Guix: I learned about it in that GTC talk I mentioned. You might want to watch the video recording of the talk to see for yourself. But don't expect too much on proxy, it is briefly discussed and without details. I only know that it exists, and I have never tried to make it work. It seems like its a very beta feature and not intended for mass-consumption yet.

The man page mentions log files, did you check those? Maybe there is something there to help you.

#8
Posted 02/12/2013 07:10 PM   
Hi Guys, Sorry for the confusion on this. I agree the documentation and online help for HyperQ-related features could be much better. HyperQ refers to two related capabilities of the Tesla K20 and later GPUs: [olist] [.]concurrency, when possible, for kernels launched into different streams in the same process[/.] [.]concurrency, when possible, between kernels launched from different MPI ranks in different processes running [/.] in parallel on the same node. [/olist] In the CUDA 5.0 release: [list] [.]#1 is supported and documented (http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#hyperq). There is also sample code in the simpleHyperQ example here: http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced[/.] [.]#2 is supported on a few Cray-based systems (e.g. Titan) in the CUDA 5.0 release. We’re working on productizing (testing, documention, etc.) this feature for a wider range of hardware/software configuration in an upcoming release.[/.] [/list] I hope this is helpful. Drop me a PM if you are interested in trying this feature in a pre-release build and providing feedback based on your experience. Thanks, Ujval Kapasi NVIDIA
Hi Guys,

Sorry for the confusion on this. I agree the documentation and online help for HyperQ-related features could be much better.

HyperQ refers to two related capabilities of the Tesla K20 and later GPUs:
  1. concurrency, when possible, for kernels launched into different streams in the same process
  2. concurrency, when possible, between kernels launched from different MPI ranks in different processes running
  3. in parallel on the same node.



In the CUDA 5.0 release:
  • #1 is supported and documented (http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#hyperq). There is also sample code in the simpleHyperQ example here: http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced
  • #2 is supported on a few Cray-based systems (e.g. Titan) in the CUDA 5.0 release. We’re working on productizing (testing, documention, etc.) this feature for a wider range of hardware/software configuration in an upcoming release.


I hope this is helpful. Drop me a PM if you are interested in trying this feature in a pre-release build and providing feedback based on your experience.

Thanks,
Ujval Kapasi
NVIDIA

Ujval Kapasi
NVIDIA

#9
Posted 02/20/2013 01:03 AM   
It has been 6 months since this thread was last active. Hopefully things have changed a little. I have machines with 16 cores and 4 Kepler cards each running on Redhat Linux. I am trying to test out codes on this set up before doing the Titan. Is there a way I can run the cuda proxy on all nodes and then have 4 CPU cores sharing one gpu card in a way that the 4 cuda launches will run concurrently on every gpu card? This way I can use all 16 cores and all 4 cards on each node. My code is set up this way: each MPI process makes a single cuda call. I have already tested with a single CPU core and verified that one GPU card can handle the number of gpu threads issued by four MPI processes. I have the mpi routines (from Oak Ridge) that identifies the device ID's for each MPI process. Do I just launch the proxy and have each MPI process calling their intended device? My second question is, if I have the proxy running, can a simplge serial code (not MPI) calls the proxy and run codes on a specified card?
It has been 6 months since this thread was last active. Hopefully things have changed a little.

I have machines with 16 cores and 4 Kepler cards each running on Redhat Linux. I am trying to test out codes on this set up before doing the Titan.

Is there a way I can run the cuda proxy on all nodes and then have 4 CPU cores sharing one gpu card in a way that the 4 cuda launches will run concurrently on every gpu card?

This way I can use all 16 cores and all 4 cards on each node.

My code is set up this way: each MPI process makes a single cuda call. I have already tested with a single CPU core and verified that one GPU card can handle the number of gpu threads issued by four MPI processes.

I have the mpi routines (from Oak Ridge) that identifies the device ID's for each MPI process. Do I just launch the proxy and have each MPI process calling their intended device?


My second question is, if I have the proxy running, can a simplge serial code (not MPI) calls the proxy and run codes on a specified card?

#10
Posted 06/25/2013 07:27 PM   
I wrote some detailed instructions to enable CUDA MPS ( formerly known as CUDA proxy) on a machine with multiple GPUs. It is an unsupported configuration, but it works. Details at http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html
I wrote some detailed instructions to enable CUDA MPS ( formerly known as CUDA proxy) on a machine with multiple GPUs.
It is an unsupported configuration, but it works. Details at http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html

#11
Posted 07/17/2013 03:18 PM   
Hello, I hope it is ok to jump in the topic. I thought that HyperQ means that different programs, from different processes, like mpi for example or openmp or just running 2 programs on the same card, the kernels would run concurrently. Is this wrong? According to this page http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/ the kernels from different mpi processes would be executed concurrently.
Hello,

I hope it is ok to jump in the topic. I thought that HyperQ means that different programs, from different processes, like mpi for example or openmp or just running 2 programs on the same card, the kernels would run concurrently. Is this wrong?

According to this page http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/ the kernels from different mpi processes would be executed concurrently.

#12
Posted 07/17/2013 08:29 PM   
Sorry to ressurect this topic, but I have one question I am not able to find the answer regarding the HQ. If I have a program with not communication with cpu which does not use the card 100 %. If I run 2 programs in the same time will it take less time than running them in the same time? On cards without HQ the kernels from each program is just executed sequentially. Will 2 programs run concurrently on Titan?
Sorry to ressurect this topic, but I have one question I am not able to find the answer regarding the HQ.
If I have a program with not communication with cpu which does not use the card 100 %. If I run 2 programs in the same time will it take less time than running them in the same time? On cards without HQ the kernels from each program is just executed sequentially. Will 2 programs run concurrently on Titan?

#13
Posted 10/11/2013 12:00 PM   
Scroll To Top