[question] simulation of strings, independent process
Hello everybody,

I hope that the post is clear enough, english is not my mother tongue.
I have a question for the community but before ask it I should probably introduce my problem (in easy therms).
In my work I should simulate the pulling of a string that have some parameters. The string start from a random
conformation and, applying force, it reach a straight conformation. I do it with a Monte Carlo simulation and
the code works in the CPU (it take severals ours just for a single simulation using one core).

The point is the following, I should simulate many (think in the order of 200) strings that have slightly different
properties but I do not have access to a cluster of CPUs.

Each simulations is -->independent<-- from another, so the cores|execution units do not have to talk or synchronize
each others, the simulation doesn't require also a lot of memory(500 kB for simulation is enough).
The code is roughly 200 lines, it is not extremely complicate but it use several "for", "if" and mathematical
functions, plus random number generators.

I thought that, even if the GPU is slower to do a single simulation I could run in parallel several simulations,
that should mean a gain of time. Am I wrong?

What do you think, is it worth to port in the GPU?

Any opinion is welcome.
Hello everybody,



I hope that the post is clear enough, english is not my mother tongue.

I have a question for the community but before ask it I should probably introduce my problem (in easy therms).

In my work I should simulate the pulling of a string that have some parameters. The string start from a random

conformation and, applying force, it reach a straight conformation. I do it with a Monte Carlo simulation and

the code works in the CPU (it take severals ours just for a single simulation using one core).



The point is the following, I should simulate many (think in the order of 200) strings that have slightly different

properties but I do not have access to a cluster of CPUs.



Each simulations is -->independent<-- from another, so the cores|execution units do not have to talk or synchronize

each others, the simulation doesn't require also a lot of memory(500 kB for simulation is enough).

The code is roughly 200 lines, it is not extremely complicate but it use several "for", "if" and mathematical

functions, plus random number generators.



I thought that, even if the GPU is slower to do a single simulation I could run in parallel several simulations,

that should mean a gain of time. Am I wrong?



What do you think, is it worth to port in the GPU?



Any opinion is welcome.

#1
Posted 04/25/2012 12:18 PM   
200 independent simulations are not many from the GPU perspective - they prefer having tens to hundredths of thousands threads running. So you will have to extract more parallelism from each individual simulation. This should however not be too hard, parallelizing the outermost for loop will likely be good enough.

A question just out of interest, if I may: Why are you using a Monte Carlo simulation? Aren't there faster ways of simulating dissipation? Is the element of randomness needed to get the right sound in an ensemble of instruments?
200 independent simulations are not many from the GPU perspective - they prefer having tens to hundredths of thousands threads running. So you will have to extract more parallelism from each individual simulation. This should however not be too hard, parallelizing the outermost for loop will likely be good enough.



A question just out of interest, if I may: Why are you using a Monte Carlo simulation? Aren't there faster ways of simulating dissipation? Is the element of randomness needed to get the right sound in an ensemble of instruments?

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/25/2012 01:09 PM   
Hello,

I am running a MC simulation for a N-body problem. At each MC step I move one particle . For the new energy N interaction are calculate and it is done on the gpu. Because my system is not so large it does not fille GPU, but because but I need many measurements, I use streams to simulate independent configurations and get more statistics.
Hello,



I am running a MC simulation for a N-body problem. At each MC step I move one particle . For the new energy N interaction are calculate and it is done on the gpu. Because my system is not so large it does not fille GPU, but because but I need many measurements, I use streams to simulate independent configurations and get more statistics.

#3
Posted 04/25/2012 02:05 PM   
[quote name='tera' date='25 April 2012 - 02:09 PM' timestamp='1335359376' post='1400807']
200 independent simulations are not many from the GPU perspective - they prefer having tens to hundredths of thousands threads running. So you will have to extract more parallelism from each individual simulation. This should however not be too hard, parallelizing the outermost for loop will likely be good enough.
[/quote]

You are saying that is not worth unless I squeeze all the parallelism possible of the GPU, did I get you correctly?

[quote name='tera' date='25 April 2012 - 02:09 PM' timestamp='1335359376' post='1400807']
A question just out of interest, if I may: Why are you using a Monte Carlo simulation? Aren't there faster ways of simulating dissipation? Is the element of randomness needed to get the right sound in an ensemble of instruments?
[/quote]

About the simulation, I will try to answer to your question however I don't know if I will be able to give you the answer that you want.
The idea is that each timestep you apply a force F(t), this force increase at each timestep.
At the same time the chain try a random change of conformation (well, not completely random, let say that a segment of the chain can change direction in the space).
The chain is in fact a polymer at a certain temperature, so each segment move (and the chain accordingly), with some constrains.
The change of energy (DeltaE) is calculated considering the force applied and the change of extension of the chain, this DeltaE is compared
to a random number for the acceptance.
[quote name='tera' date='25 April 2012 - 02:09 PM' timestamp='1335359376' post='1400807']

200 independent simulations are not many from the GPU perspective - they prefer having tens to hundredths of thousands threads running. So you will have to extract more parallelism from each individual simulation. This should however not be too hard, parallelizing the outermost for loop will likely be good enough.





You are saying that is not worth unless I squeeze all the parallelism possible of the GPU, did I get you correctly?



[quote name='tera' date='25 April 2012 - 02:09 PM' timestamp='1335359376' post='1400807']

A question just out of interest, if I may: Why are you using a Monte Carlo simulation? Aren't there faster ways of simulating dissipation? Is the element of randomness needed to get the right sound in an ensemble of instruments?





About the simulation, I will try to answer to your question however I don't know if I will be able to give you the answer that you want.

The idea is that each timestep you apply a force F(t), this force increase at each timestep.

At the same time the chain try a random change of conformation (well, not completely random, let say that a segment of the chain can change direction in the space).

The chain is in fact a polymer at a certain temperature, so each segment move (and the chain accordingly), with some constrains.

The change of energy (DeltaE) is calculated considering the force applied and the change of extension of the chain, this DeltaE is compared

to a random number for the acceptance.

#4
Posted 04/25/2012 02:06 PM   
If you can parallelize at least to get a few blocks you can fill the GPU. It is not going to achieve the max performance, but for me It gave enough speed up to use it for production runs.
If you can parallelize at least to get a few blocks you can fill the GPU. It is not going to achieve the max performance, but for me It gave enough speed up to use it for production runs.

#5
Posted 04/25/2012 02:09 PM   
[quote name='Fabrizio' date='25 April 2012 - 03:06 PM' timestamp='1335362791' post='1400826']
You are saying that is not worth unless I squeeze all the parallelism possible of the GPU, did I get you correctly?
[/quote]
You don't need to squeeze out all parallelism, but likely you need more than just the ~200 independent simulations. But prospects are quite bright: Using one block of between 64..1024 threads for each simulation should be easy (they can fully communicate with each other and run in sync), you are right at the number of threads you need.

[quote name='Fabrizio' date='25 April 2012 - 03:06 PM' timestamp='1335362791' post='1400826']
About the simulation, I will try to answer to your question however I don't know if I will be able to give you the answer that you want.
The idea is that each timestep you apply a force F(t), this force increase at each timestep.
At the same time the chain try a random change of conformation (well, not completely random, let say that a segment of the chain can change direction in the space).
The chain is in fact a polymer at a certain temperature, so each segment move (and the chain accordingly), with some constrains.
The change of energy (DeltaE) is calculated considering the force applied and the change of extension of the chain, this DeltaE is compared
to a random number for the acceptance.
[/quote]
Thanks for the explanation!
[quote name='Fabrizio' date='25 April 2012 - 03:06 PM' timestamp='1335362791' post='1400826']

You are saying that is not worth unless I squeeze all the parallelism possible of the GPU, did I get you correctly?



You don't need to squeeze out all parallelism, but likely you need more than just the ~200 independent simulations. But prospects are quite bright: Using one block of between 64..1024 threads for each simulation should be easy (they can fully communicate with each other and run in sync), you are right at the number of threads you need.



[quote name='Fabrizio' date='25 April 2012 - 03:06 PM' timestamp='1335362791' post='1400826']

About the simulation, I will try to answer to your question however I don't know if I will be able to give you the answer that you want.

The idea is that each timestep you apply a force F(t), this force increase at each timestep.

At the same time the chain try a random change of conformation (well, not completely random, let say that a segment of the chain can change direction in the space).

The chain is in fact a polymer at a certain temperature, so each segment move (and the chain accordingly), with some constrains.

The change of energy (DeltaE) is calculated considering the force applied and the change of extension of the chain, this DeltaE is compared

to a random number for the acceptance.



Thanks for the explanation!

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 04/25/2012 10:56 PM   
Scroll To Top