Execute different kinds of code in one MP
Hello,

I want to port a Genetic Programming algorithm to CUDA. I have several "species" of algorithms that should co-evolve (and have different goals); to find out the fitness of candidate algorithms, one of each of these is assigned to a test run. Several test runs are simulated with random data; then, the fitness of each algorithm can be evaluated.
The fittest algorithms are promoted; they are copied everywhere, and some mutations are introduced. The process is repeated.

My naive approach is to give one multiprocessor to one test run. In the multiprocessor lives one candidate for all the species. Each streaming processor simulates one test run.

Now, I have heard that it can be bad to execute conditionals, because in some cases, both branches have to be executed. Is that also a problem for the long term? I.e.: If I have
a branch at the beginning of my kernel depending on the thread number (so that one thread can be species X and another thread can be species Y), will all the code be executed on
both branches indefinitely or will this just happen for some clock cycles?

Regardless of that: Would you my basic plan differently?

Thanks in advance.
Hello,



I want to port a Genetic Programming algorithm to CUDA. I have several "species" of algorithms that should co-evolve (and have different goals); to find out the fitness of candidate algorithms, one of each of these is assigned to a test run. Several test runs are simulated with random data; then, the fitness of each algorithm can be evaluated.

The fittest algorithms are promoted; they are copied everywhere, and some mutations are introduced. The process is repeated.



My naive approach is to give one multiprocessor to one test run. In the multiprocessor lives one candidate for all the species. Each streaming processor simulates one test run.



Now, I have heard that it can be bad to execute conditionals, because in some cases, both branches have to be executed. Is that also a problem for the long term? I.e.: If I have

a branch at the beginning of my kernel depending on the thread number (so that one thread can be species X and another thread can be species Y), will all the code be executed on

both branches indefinitely or will this just happen for some clock cycles?



Regardless of that: Would you my basic plan differently?



Thanks in advance.

#1
Posted 03/27/2012 02:11 AM   
I'd assume you test the algorithms on a few different sets of data. If so, have each block (or at least each warp) run the same algorithm on a number of different data sets.

Conditionals where both branches are indeed expensive (the two branches are executed one after the other, for their full duration and not just a few cycles). You may want to read up the chapters about the programming model and its hardware implementation in the Programming Guide.
I'd assume you test the algorithms on a few different sets of data. If so, have each block (or at least each warp) run the same algorithm on a number of different data sets.



Conditionals where both branches are indeed expensive (the two branches are executed one after the other, for their full duration and not just a few cycles). You may want to read up the chapters about the programming model and its hardware implementation in the Programming Guide.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 03/27/2012 09:59 AM   
[quote name='tera' date='27 March 2012 - 10:59 AM' timestamp='1332842397' post='1388478']
I'd assume you test the algorithms on a few different sets of data. If so, have each block (or at least each warp) run the same algorithm on a number of different data sets.

Conditionals where both branches are indeed expensive (the two branches are executed one after the other, for their full duration and not just a few cycles). You may want to read up the chapters about the programming model and its hardware implementation in the Programming Guide.
[/quote]

Thank you! Unfortunately, that would be suboptimal. In one test run, all of these species have to exchange data, so I would like that data exchange to be quick and the global bus
to be uncrowded. Any suggestions?
[quote name='tera' date='27 March 2012 - 10:59 AM' timestamp='1332842397' post='1388478']

I'd assume you test the algorithms on a few different sets of data. If so, have each block (or at least each warp) run the same algorithm on a number of different data sets.



Conditionals where both branches are indeed expensive (the two branches are executed one after the other, for their full duration and not just a few cycles). You may want to read up the chapters about the programming model and its hardware implementation in the Programming Guide.





Thank you! Unfortunately, that would be suboptimal. In one test run, all of these species have to exchange data, so I would like that data exchange to be quick and the global bus

to be uncrowded. Any suggestions?

#3
Posted 03/27/2012 10:30 AM   
Try to understand the fundamental programming model before worrying about the details.

You can have different species in one block, as long as every warp (32 threads) runs the same code path. Global communication always requires going global memory in CUDA.
Try to understand the fundamental programming model before worrying about the details.



You can have different species in one block, as long as every warp (32 threads) runs the same code path. Global communication always requires going global memory in CUDA.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 03/27/2012 10:56 AM   
[quote name='tera' date='27 March 2012 - 11:56 AM' timestamp='1332845776' post='1388500']
Try to understand the fundamental programming model before worrying about the details.

You can have different species in one block, as long as every warp (32 threads) runs the same code path. Global communication always requires going global memory in CUDA.
[/quote]

So one block can be divided across several multiprocessors? I read [url="http://www.pgroup.com/lit/articles/insider/v2n1a5.htm"]here[/url] that one MP in a Tesla card has 8 stream processors, which, as I understand, means that one multiprocessor only has space for one warp. The thing I wanted to do is take advantage of the shared memory in one multiprocessor; because most of the data flow is localized in
one test run, this should give me speedup in comparison with all MPs having to use the same global memory bus. Is that possible/useful?
[quote name='tera' date='27 March 2012 - 11:56 AM' timestamp='1332845776' post='1388500']

Try to understand the fundamental programming model before worrying about the details.



You can have different species in one block, as long as every warp (32 threads) runs the same code path. Global communication always requires going global memory in CUDA.





So one block can be divided across several multiprocessors? I read here that one MP in a Tesla card has 8 stream processors, which, as I understand, means that one multiprocessor only has space for one warp. The thing I wanted to do is take advantage of the shared memory in one multiprocessor; because most of the data flow is localized in

one test run, this should give me speedup in comparison with all MPs having to use the same global memory bus. Is that possible/useful?

#5
Posted 03/27/2012 02:27 PM   
Don't rely on random sources from the internet. Read the [url="http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf"]CUDA C Programming Guide[/url] from Nvidia instead, which is the definitive source. At least read the short chapters 2 and 4 before continuing to think about your implementation, because CUDA is quite different from what you are describing above.
Don't rely on random sources from the internet. Read the CUDA C Programming Guide from Nvidia instead, which is the definitive source. At least read the short chapters 2 and 4 before continuing to think about your implementation, because CUDA is quite different from what you are describing above.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 03/28/2012 03:08 PM   
With current, official documentation, things start becoming clear. Thanks.
With current, official documentation, things start becoming clear. Thanks.

#7
Posted 04/01/2012 05:49 PM   
Scroll To Top