Hyper-Q and OpenMP on single GTX-Titan GPU

vincentroberge · May 15, 2013, 2:01pm

I have three different and independent algorithms written in CUDA. Each algorithm is a series of kernels, but the kernels do not fully use the processing power of the latest GPU. I would like to run the 3 algorithms in parallel using 3 OpenMP threads each controlling one stream of the GPU. The three thread would fill 3 hyper-Q queues in parallel and the GPU would be able to schedule the different kernels in an optimal fashion in order to maximize its use. The code below illustrate my problem. in this example, kernel_a, kernel_b, etc. are just dummy kernel that uses processing power. This code execute well on my GTX680, but I am looking at purchasing a GTX Titan to benefit from Hyper-Q.

My questions are:

Do you see any problem with my idea?
Can different OpenMP thread access a single GPU? (my code execute fine, but could I run into problems)
Do I need a GTX-Titan or a Tesla-K20 for hyper-Q with OpenMP (no MPI here)?
Do I need to protect every kernel call with a “#pragma omp ctitical” statement?

#include <omp.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#define N 100000

__global__ void kernel_a()
{
	float sum = 0.0;
	for(int i=0; i<N; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_b()
{
	float sum = 0.0;
	for(int i=0; i<N/2; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_c()
{
	float sum = 0.0;
	for(int i=0; i<N/4; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_d()
{
	float sum = 0.0;
	for(int i=0; i<N/8; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

int main()
{

	int n_streams = 32;
	int n_threads = 4;
	cudaError_t cudaStatus;
	
	// Allocate and initialize an array of stream handles
	cudaStream_t *streams = (cudaStream_t *) malloc(n_streams * sizeof(cudaStream_t));
	for (int i = 0 ; i < n_streams ; i++)
	{
		cudaStreamCreate(&(streams[i]));
	}

	// Parallel section
	#pragma omp parallel num_threads(n_threads)
	{
		for (int i=omp_get_thread_num(); i<n_streams; i=i+n_threads)
		{

			kernel_a<<<1,1,0,streams[i]>>>();

			kernel_b<<<1,1,0,streams[i]>>>();

			kernel_c<<<1,1,0,streams[i]>>>();

			kernel_d<<<1,1,0,streams[i]>>>();
		}
	}

	// release all stream
	for (int i = 0 ; i < n_streams ; i++)
	{
		cudaStreamDestroy(streams[i]);
	}
	free(streams);

        // cudaDeviceReset must be called before exiting in order for profiling and
        // tracing tools such as Nsight and Visual Profiler to show complete traces.
        cudaStatus = cudaDeviceReset();

	return 0;
}

Manpowre · May 16, 2013, 8:56am

I actually wonder the same about the Titan.
I have 2x Titans for this specific reason, what I found out so far, is that the executed code needs to be in same context. if this means that you can run 2-3x openmp threads I dont know. But I did test 2x simultaniously console threads and that just took the processing down by half for each program.

I see the K20x has support for the execution distributor which again should support 2x separate contexts.

If you want I can this code tonight when I come home to see the benefit… but I can clearly see that coding this way makes it alot more complicated.

AlexanderAgathos · May 17, 2013, 5:11pm

vincentroberge:

I have three different and independent algorithms written in CUDA. Each algorithm is a series of kernels, but the kernels do not fully use the processing power of the latest GPU. I would like to run the 3 algorithms in parallel using 3 OpenMP threads each controlling one stream of the GPU. The three thread would fill 3 hyper-Q queues in parallel and the GPU would be able to schedule the different kernels in an optimal fashion in order to maximize its use. The code below illustrate my problem. in this example, kernel_a, kernel_b, etc. are just dummy kernel that uses processing power. This code execute well on my GTX680, but I am looking at purchasing a GTX Titan to benefit from Hyper-Q.

My questions are:

Do you see any problem with my idea?

Can different OpenMP thread access a single GPU? (my code execute fine, but could I run into problems)

Do I need a GTX-Titan or a Tesla-K20 for hyper-Q with OpenMP (no MPI here)?

Do I need to protect every kernel call with a “#pragma omp ctitical” statement?
#include <omp.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"


#define N 100000


__global__ void kernel_a()
{
	float sum = 0.0;
	for(int i=0; i<N; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_b()
{
	float sum = 0.0;
	for(int i=0; i<N/2; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_c()
{
	float sum = 0.0;
	for(int i=0; i<N/4; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

__global__ void kernel_d()
{
	float sum = 0.0;
	for(int i=0; i<N/8; i++)
	{
		sum = sum + cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1)*cos(0.1);
	}
}

int main()
{

	int n_streams = 32;
	int n_threads = 4;
	cudaError_t cudaStatus;
	
	// Allocate and initialize an array of stream handles
	cudaStream_t *streams = (cudaStream_t *) malloc(n_streams * sizeof(cudaStream_t));
	for (int i = 0 ; i < n_streams ; i++)
	{
		cudaStreamCreate(&(streams[i]));
	}

	// Parallel section
	#pragma omp parallel for num_threads(n_threads)
	{
		for (int i=0; i<n_streams; i++)
		{

			kernel_a<<<1,1,0,streams[i]>>>();

			kernel_b<<<1,1,0,streams[i]>>>();

			kernel_c<<<1,1,0,streams[i]>>>();

			kernel_d<<<1,1,0,streams[i]>>>();
		}
	}

	// release all stream
	for (int i = 0 ; i < n_streams ; i++)
	{
		cudaStreamDestroy(streams[i]);
	}
	free(streams);

        // cudaDeviceReset must be called before exiting in order for profiling and
        // tracing tools such as Nsight and Visual Profiler to show complete traces.
        cudaStatus = cudaDeviceReset();

	return 0;
}

Oh God I have corrected your code into a civilized OpenMP version. Look at the code.
There is no problem initiating CUDA Streams with OpenMP. It is necessary in cases you want to distribute the data and have the same kernels work on different data. Remember a queue can execute two kernels concurrently (you have 4). Also GTX-Titan can handle up to 8 streams at the same time while K20 can handle 32. There is no technical info on this by NVIDIA so I think it should be stated here for someone interested to see. I have verified it with simple hyper Q example. 8 Streams no more for the GTX-Titan. The GTX - Titan is a cheap alternative of K20 but NVIDIA has made it certain for Companies and Universities with a budget to buy the Tesla GK 110.
About the critical question think about it do we have a race here? As I told you number of streams should be 8 and number of kernels should be 2. Do your timing and you will see your speedup.

AlexanderAgathos · May 17, 2013, 9:29pm

Also as a last note on this. I will begin using Hyper-Q now that I have a GTX-Titan card. I am greatly dissapointed that they reduce the number of concurrent streams 4 times. I will have at my disposal soon a Cluster with K20s so I can test the MPI Hyper-Q. It will be wonderfull to launch multiple MPI processes on the same node. So if I have 10 GPUs I can actually achieve the result of 30 Fermis at least I figure. Anyway more on the paper I will make on this issue. ;-)

What I am doing is very interesting. I can beat a K20 with just two M2070s but this is by issuing one stream. If I issue two or more streams or MPI processes will the GTX-Titan/K20 thrive? The only logical eplanation is out of order execution in breaking the problem in subproblems and issuing it as seperate processes / streams. Anyway I am saying a lot. I hope to inspire some in the community…

vacaloca · May 19, 2013, 8:55pm

Because the difference is a little hard to see, here it is explicitely:

***** vincentroberge
    // Parallel section
    #pragma omp parallel num_threads(n_threads)
    {
    for (int i=omp_get_thread_num(); i<n_streams; i=i+n_threads)
    {
***** AlexanderAgathos
    // Parallel section
    #pragma omp parallel for num_threads(n_threads)
    {
    for (int i=0; i<n_streams; i++)
    {
*****

AlexanderAgathos · May 21, 2013, 3:35pm

And I have to say that after testing the MPI-Hyper-Q on my code, which is considered to be heavy, Hyper-Q really works its not a publicity rumor, it really works.