Using CUDA to run many instances
Hi together ...

I just got an overview of CUDA and I'm impressed.
But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can't be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).

My question:
Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.

Any help is appreciated.
Thanks.
Tommy
Hi together ...



I just got an overview of CUDA and I'm impressed.

But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can't be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).



My question:

Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.



Any help is appreciated.

Thanks.

Tommy

#1
Posted 03/31/2012 09:09 AM   
It is possible. I ran multiple programs on the same gpu and it worked. There is something better. I have a problem in which there are large gaps in time, so something else could run. I used streams to run several copies of the same system in parallel.
It is possible. I ran multiple programs on the same gpu and it worked. There is something better. I have a problem in which there are large gaps in time, so something else could run. I used streams to run several copies of the same system in parallel.

#2
Posted 03/31/2012 12:05 PM   
Hi, thank you very much.
Any Examples, Tutorials, ...?

Kind regards,
Tommy
Hi, thank you very much.

Any Examples, Tutorials, ...?



Kind regards,

Tommy

#3
Posted 03/31/2012 12:08 PM   
Here is some pieces of code how I use it. In the code I have the positions of particles defined double3 and each system has Np particles. I define arrays of pointers and I allocate them one by one on both host and device

[code]

#define nstr 3 // number of streams I use

int main(void)
{
double3 *pos[nstr];
double3 *dev_pos[nstr];
double *dev_newuuu[nstr],*dev_olduuu;
double *dev_charge;
double jump;
double *h_ene[nstr];
static int h_acc[1];
double *dev_energy[nstr],*dev_totalene;
int *dev_acceptance;
double3 jxyz[nstr];
double rnd[nstr];
int atom_i[nstr];
double enepene[nstr];

cudaStream_t stream[nstr]; // streams

// memory allocations
for (int is = 0; is < nstr; is++)
{
cudaStreamCreate(&stream[is]);
cudaMalloc(&dev_pos[is],sizeof(double3)*Np);
cudaMalloc(&dev_energy[is],sizeof(double));
cudaMalloc(&dev_newuuu[is],sizeof(double)*gss);
cudaHostAlloc(&pos[is],sizeof(double3)*Np,cudaHostAllocDefault);
cudaHostAlloc(&h_ene[is],sizeof(double),cudaHostAllocDefault);
}

float gputime;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

init_config(pos,Np,lx,ly,lz,diamsq); // initilize the positions on the host

// copy the initial configurations to the device
for (int is = 0; is < nstr; is++)
{
cudaMemcpy(dev_pos[is],pos[is], sizeof(double3)*Np,cudaMemcpyHostToDevice);
}

// Exampl of how I run the strams in parallel
for(int ist=0;ist<nstr;ist++)
{
cudaMemcpy(dev_energy[ist], h_ene[ist], sizeof(double),cudaMemcpyHostToDevice);
}

cudaEventRecord(start,0);

for(int imes=0;imes<Neq;imes++)
{
h_acc[0]=0.0;
cudaMemcpy(dev_acceptance, h_acc, sizeof(int),cudaMemcpyHostToDevice);
for(int idl=0;idl<Nout;idl++)
{
for(int isp=0;isp<Np;isp++)
{
for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams
{
jxyz[ist].x=jump*(2.0*genrand64_real2()-1.0);

jxyz[ist].y=jump*(2.0*genrand64_real2()-1.0);

jxyz[ist].z=jump*(2.0*genrand64_real2()-1.0);

atom_i[ist]=round((Np-1)*genrand64_real2());

rnd[ist]=genrand64_real2();

newMCenergyarray<<<gss,2*bsl,0,stream[ist]>>> (dev_pos[ist],dev_newuuu[ist], Np,jxyz[ist],atom_i[ist]); // first step
}
for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams
{
vsu<<<1,1,0,stream[ist]>>>dev_pos[ist],dev_newuuu[ist],jxyz[ist],atom_i[ist],dev_acceptance,dev_energy[ist],rnd[ist]); // second step
}
}

}
}

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&gputime,start,stop);

cudaEventDestroy(start);
cudaEventDestroy(stop) ;
printf(" \n");

printf("Time = %g \n", gputime/1000.0f);

printf(" \n");

[/code]
Here is some pieces of code how I use it. In the code I have the positions of particles defined double3 and each system has Np particles. I define arrays of pointers and I allocate them one by one on both host and device







#define nstr 3 // number of streams I use



int main(void)

{

double3 *pos[nstr];

double3 *dev_pos[nstr];

double *dev_newuuu[nstr],*dev_olduuu;

double *dev_charge;

double jump;

double *h_ene[nstr];

static int h_acc[1];

double *dev_energy[nstr],*dev_totalene;

int *dev_acceptance;

double3 jxyz[nstr];

double rnd[nstr];

int atom_i[nstr];

double enepene[nstr];



cudaStream_t stream[nstr]; // streams



// memory allocations

for (int is = 0; is < nstr; is++)

{

cudaStreamCreate(&stream[is]);

cudaMalloc(&dev_pos[is],sizeof(double3)*Np);

cudaMalloc(&dev_energy[is],sizeof(double));

cudaMalloc(&dev_newuuu[is],sizeof(double)*gss);

cudaHostAlloc(&pos[is],sizeof(double3)*Np,cudaHostAllocDefault);

cudaHostAlloc(&h_ene[is],sizeof(double),cudaHostAllocDefault);

}



float gputime;

cudaEvent_t start,stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);



init_config(pos,Np,lx,ly,lz,diamsq); // initilize the positions on the host



// copy the initial configurations to the device

for (int is = 0; is < nstr; is++)

{

cudaMemcpy(dev_pos[is],pos[is], sizeof(double3)*Np,cudaMemcpyHostToDevice);

}



// Exampl of how I run the strams in parallel

for(int ist=0;ist<nstr;ist++)

{

cudaMemcpy(dev_energy[ist], h_ene[ist], sizeof(double),cudaMemcpyHostToDevice);

}



cudaEventRecord(start,0);



for(int imes=0;imes<Neq;imes++)

{

h_acc[0]=0.0;

cudaMemcpy(dev_acceptance, h_acc, sizeof(int),cudaMemcpyHostToDevice);

for(int idl=0;idl<Nout;idl++)

{

for(int isp=0;isp<Np;isp++)

{

for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

{

jxyz[ist].x=jump*(2.0*genrand64_real2()-1.0);



jxyz[ist].y=jump*(2.0*genrand64_real2()-1.0);



jxyz[ist].z=jump*(2.0*genrand64_real2()-1.0);



atom_i[ist]=round((Np-1)*genrand64_real2());



rnd[ist]=genrand64_real2();



newMCenergyarray<<<gss,2*bsl,0,stream[ist]>>> (dev_pos[ist],dev_newuuu[ist], Np,jxyz[ist],atom_i[ist]); // first step

}

for(int ist=0;ist<nstr;ist++) // here calling the same function for different streams

{

vsu<<<1,1,0,stream[ist]>>>dev_pos[ist],dev_newuuu[ist],jxyz[ist],atom_i[ist],dev_acceptance,dev_energy[ist],rnd[ist]); // second step

}

}



}

}



cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&gputime,start,stop);



cudaEventDestroy(start);

cudaEventDestroy(stop) ;

printf(" \n");



printf("Time = %g \n", gputime/1000.0f);



printf(" \n");


#4
Posted 03/31/2012 12:30 PM   
Thank you ... very instructive, i'll give it a try.
Thank you ... very instructive, i'll give it a try.

#5
Posted 03/31/2012 03:05 PM   
I just looked on the internet for stream tutorials and ripped off some parts of the code.
I just looked on the internet for stream tutorials and ripped off some parts of the code.

#6
Posted 03/31/2012 03:17 PM   
[quote name='tommy000001' date='31 March 2012 - 03:09 AM' timestamp='1333184971' post='1390253']
I just got an overview of CUDA and I'm impressed.
But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can't be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).

My question:
Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.
[/quote]

While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.
[quote name='tommy000001' date='31 March 2012 - 03:09 AM' timestamp='1333184971' post='1390253']

I just got an overview of CUDA and I'm impressed.

But before spending too much time with this System I need to know if CUDA is suited to run multiple instances of one program. My program can't be parallelized but I have to run many instances with different parameters. Every instance needs about 5 MB of memory and I need an arbitrary number of instances (the more, the better).



My question:

Is it possible to write a wrapper program wich distributes my current program to the graphics chip and then collects the results? How efficient would this be. My program takes (depending on parameters) about 1 hour on a 3 GHz CPU.





While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.

#7
Posted 04/01/2012 11:43 AM   
[quote name='seibert' date='01 April 2012 - 12:43 PM' timestamp='1333280609' post='1390667']
While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.
[/quote]
So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.
[quote name='seibert' date='01 April 2012 - 12:43 PM' timestamp='1333280609' post='1390667']

While CUDA does allow multiple programs to run on one GPU, the kernels do not run concurrently. The GPU driver switches control of the GPU between client programs between each kernel launch. If your source of parallelism is running multiple instances, CUDA will not accelerate this at all.



So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.

#8
Posted 04/01/2012 12:17 PM   
[quote name='pasoleatis' date='01 April 2012 - 06:17 AM' timestamp='1333282621' post='1390674']
So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.
[/quote]

Yes, multiple streams in the same process can launch concurrent kernels (on Fermi and later, anyway). Different processes have to time slice the entire GPU. Moreover, the multitasking is cooperative, not preemptive, so the GPU can only switch processes between operations.

The reason I wanted to jump in here is that it sounds like original poster might think that the GPU is like a multicore CPU, and parallelism can be achieved by launching many independent serial processes. That absolutely does not work on CUDA devices, unfortunately. The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.

That said, the compute problem described could be amenable to a data-parallel calculation, I just wanted to throw some caution in so no one is surprised later. :)
[quote name='pasoleatis' date='01 April 2012 - 06:17 AM' timestamp='1333282621' post='1390674']

So using streams is the only possibility to gather more statistics for programs which would not fill the gpu? At least for my problem it worked to some extent.





Yes, multiple streams in the same process can launch concurrent kernels (on Fermi and later, anyway). Different processes have to time slice the entire GPU. Moreover, the multitasking is cooperative, not preemptive, so the GPU can only switch processes between operations.



The reason I wanted to jump in here is that it sounds like original poster might think that the GPU is like a multicore CPU, and parallelism can be achieved by launching many independent serial processes. That absolutely does not work on CUDA devices, unfortunately. The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.



That said, the compute problem described could be amenable to a data-parallel calculation, I just wanted to throw some caution in so no one is surprised later. :)

#9
Posted 04/01/2012 03:07 PM   
Yes. I was asked like that before. People asking me if they could just run 448 processes in the same time like each core would be a cpu.
Yes. I was asked like that before. People asking me if they could just run 448 processes in the same time like each core would be a cpu.

#10
Posted 04/01/2012 03:25 PM   
Thanks for your replies ...

[quote]The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.[/quote]

Unfortunately I thought it can be used like a many core cpu, too good to be true.

Kind regards.
Thanks for your replies ...



The parallelism has to be found from within the same process, and the goal is to exploit data parallelism rather than task parallelism.




Unfortunately I thought it can be used like a many core cpu, too good to be true.



Kind regards.

#11
Posted 04/01/2012 07:31 PM   
Scroll To Top