Using CUDA to run many instances

tommy000001 · March 31, 2012, 9:09am

Hi together …

I just got an overview of CUDA and I’m impressed.
But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can’t be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).

My question:
Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.

Any help is appreciated.
Thanks.
Tommy

pasoleatis · March 31, 2012, 12:05pm

It is possible. I ran multiple programs on the same gpu and it worked. There is something better. I have a problem in which there are large gaps in time, so something else could run. I used streams to run several copies of the same system in parallel.

tommy000001 · March 31, 2012, 12:08pm

Hi, thank you very much.
Any Examples, Tutorials, …?

Kind regards,
Tommy

pasoleatis · March 31, 2012, 12:30pm

Here is some pieces of code how I use it. In the code I have the positions of particles defined double3 and each system has Np particles. I define arrays of pointers and I allocate them one by one on both host and device

#define nstr 3 // number of streams I use 

int main(void)

{

    double3 *pos[nstr];

    double3 *dev_pos[nstr];

    double *dev_newuuu[nstr],*dev_olduuu;

    double *dev_charge; 

    double jump; 

    double *h_ene[nstr];

    static int h_acc[1];

    double *dev_energy[nstr],*dev_totalene;

    int *dev_acceptance;

    double3 jxyz[nstr];

    double rnd[nstr];

    int atom_i[nstr];

    double enepene[nstr];

cudaStream_t stream[nstr]; // streams

// memory allocations     

    for (int is = 0; is < nstr; is++)

    {

    cudaStreamCreate(&stream[is]);

    cudaMalloc(&dev_pos[is],sizeof(double3)*Np);

    cudaMalloc(&dev_energy[is],sizeof(double));

    cudaMalloc(&dev_newuuu[is],sizeof(double)*gss);

    cudaHostAlloc(&pos[is],sizeof(double3)*Np,cudaHostAllocDefault);

    cudaHostAlloc(&h_ene[is],sizeof(double),cudaHostAllocDefault);

    }

float gputime;

    cudaEvent_t start,stop;

    cudaEventCreate(&start);

    cudaEventCreate(&stop);

init_config(pos,Np,lx,ly,lz,diamsq); // initilize the positions on the host

// copy the initial configurations to the device     

    for (int is = 0; is < nstr; is++)

    {    

    cudaMemcpy(dev_pos[is],pos[is], sizeof(double3)*Np,cudaMemcpyHostToDevice);

    }

// Exampl of how I run the strams in parallel 

    for(int ist=0;ist<nstr;ist++)

    {

    cudaMemcpy(dev_energy[ist], h_ene[ist], sizeof(double),cudaMemcpyHostToDevice);

    }

cudaEventRecord(start,0);

for(int imes=0;imes<Neq;imes++)

    {

      h_acc[0]=0.0;

      cudaMemcpy(dev_acceptance, h_acc, sizeof(int),cudaMemcpyHostToDevice);

    	for(int idl=0;idl<Nout;idl++)

    	{

    		for(int isp=0;isp<Np;isp++)

    		{

			for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

			{

    			      jxyz[ist].x=jump*(2.0*genrand64_real2()-1.0);

			      jxyz[ist].y=jump*(2.0*genrand64_real2()-1.0);

			      jxyz[ist].z=jump*(2.0*genrand64_real2()-1.0);

			

			      atom_i[ist]=round((Np-1)*genrand64_real2());

			

			      rnd[ist]=genrand64_real2();

			

			      newMCenergyarray<<<gss,2*bsl,0,stream[ist]>>> (dev_pos[ist],dev_newuuu[ist], Np,jxyz[ist],atom_i[ist]); // first step

			}

		      for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

			{

                        vsu<<<1,1,0,stream[ist]>>>dev_pos[ist],dev_newuuu[ist],jxyz[ist],atom_i[ist],dev_acceptance,dev_energy[ist],rnd[ist]); // second step

			}

		}        

    			    			

    	}   	

    }    

cudaEventRecord(stop,0);

    cudaEventSynchronize(stop);

    cudaEventElapsedTime(&gputime,start,stop);

cudaEventDestroy(start);

    cudaEventDestroy(stop) ;   

    printf(" \n");

printf("Time = %g \n",  gputime/1000.0f);  

printf(" \n");

tommy000001 · March 31, 2012, 3:05pm

Thank you … very instructive, i’ll give it a try.

pasoleatis · March 31, 2012, 3:17pm

I just looked on the internet for stream tutorials and ripped off some parts of the code.

seibert · April 1, 2012, 11:43am

While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.

pasoleatis · April 1, 2012, 12:17pm

So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.

seibert · April 1, 2012, 3:07pm

Yes, multiple streams in the same process can launch concurrent kernels (on Fermi and later, anyway). Different processes have to time slice the entire GPU. Moreover, the multitasking is cooperative, not preemptive, so the GPU can only switch processes between operations.

The reason I wanted to jump in here is that it sounds like original poster might think that the GPU is like a multicore CPU, and parallelism can be achieved by launching many independent serial processes. That absolutely does not work on CUDA devices, unfortunately. The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.

That said, the compute problem described could be amenable to a data-parallel calculation, I just wanted to throw some caution in so no one is surprised later. :)

pasoleatis · April 1, 2012, 3:25pm

Yes. I was asked like that before. People asking me if they could just run 448 processes in the same time like each core would be a cpu.

tommy000001 · April 1, 2012, 7:31pm

Thanks for your replies …

Unfortunately I thought it can be used like a many core cpu, too good to be true.

Kind regards.