Imagine you have a number of threads each responsible for the simulation of a distinct particle( 3D with position(x,y,z) and direction(l,m,n) with some other info).
There are regions in this simulation world where these particles can generate secondary particles based on some random process. Some 3D regions are more probable to generate secondaries than others, and once it has been determined that there has been an additional particle generated, I need to save a bunch of data related to that generation at that point in the simulated process.
My previous version allowed only for the generation of a single secondary particle, so each thread either generated 0 or 1 secondary particles. At the end of simulation I would save data for both the primary and secondary particles, with a ‘dummy’ value for the secondary if one was not generated.
Now I need to implement a version where there can be (theoretically) up to 60,000 secondary particle events generated from a primary during the course of each particles simulation. A primary particle could be in a ‘volatile’ region where each movement in that region has a high probability of generating a secondary event.
Once a secondary particle event has been generated, I do not need to simulate that secondary further, rather just save a bunch of information about the state of things at that exact point in the simulation.
So the bottom line is that I have no realistic upper limit on the number of secondaries, and cannot use pre-allocated registers or shared memory for local storage and ordering before writing out the global memory.
The other GPU versions of this type of simulation I have seen use a single 32 bit integer global value which is updated via atomic operations. This atomic operation would return an integer value which is the next available index in global memory for the secondary particle event information to be stored. A thread in that implementation which generated a secondary particle event would have to diverge from the rest of threads in a block, update an global counter,save about 120 bytes of information to global memory, then continue the simulation.
Such implementations are slow and I am trying to create a more efficient implementation.
Overall the probability that a large number of secondary events will be generated is small, but I was told not to rely on that fact and assume each thread(primary particle) could generate a secondary particle each step in the simulation.
Since each thread’s primary particle in a thread block could be in very different regions of the 3D world, it is unlikely that multiple threads in a block/warp will simultaneously need to store information related to their generated secondary particles at the same point in the simulation process.
A toolkit like Geant4 (CPU based) simulates particles serially one by one, and if a secondary particle is generated it saves that info and later simulates that secondary particle in a subsequent run.
My current CUDA application operates much differently so that approach is not possible. Already I am close to my resource limit per thread block (shared memory and registers), so more than two potential secondary particles will be a challenge with the current structure.
Ultimately I can use global counters and disparate global memory updates as well, but would like to test other potentially more efficient methods first.
Anyone have any ideas or links to papers which address this problem in a parallel architecture?
Thanks.