Many threads updating a single global variable
I am new to CUDA. In my program, Multiple threads have to update a global variable count, which is initially 0. Each thread takes a value and if that value is not equal to zero, the thread should increment the global variable count by 1. I tried it in the code. For the array containing 1665 non-zero elements, i get result as 38, 39, 31, etc. Each time the value differs. How do i synchronize the writes by threads to the same location. Thank you
I am new to CUDA. In my program, Multiple threads have to update a global variable count, which is initially 0. Each thread takes a value and if that value is not equal to zero, the thread should increment the global variable count by 1. I tried it in the code. For the array containing 1665 non-zero elements, i get result as 38, 39, 31, etc. Each time the value differs. How do i synchronize the writes by threads to the same location. Thank you

#1
Posted 03/28/2012 01:52 PM   
Use atomics, example atomicInc etc,. Check the programming guide for reference.
Use atomics, example atomicInc etc,. Check the programming guide for reference.

#2
Posted 03/28/2012 02:25 PM   
Using atomicInc in every thread will serialize your execution and it will run *very* slow. Use a reduction instead. thrust has reduction routines already programmed for you.
Using atomicInc in every thread will serialize your execution and it will run *very* slow. Use a reduction instead. thrust has reduction routines already programmed for you.

#3
Posted 03/30/2012 12:15 PM   
Both approaches make sense depending on how many threads will need to increment the global counter. If only a small fraction of threads will increment the counter, atomicAdd() is quick and easy. If a large number of threads need to increment the counter, then a reduction-based algorithm, like DrAnderson suggests, is a good approach.

(One thing I have not ever tried is a block-level reduction in shared memory, followed by one atomicAdd() per block on the global counter.)
Both approaches make sense depending on how many threads will need to increment the global counter. If only a small fraction of threads will increment the counter, atomicAdd() is quick and easy. If a large number of threads need to increment the counter, then a reduction-based algorithm, like DrAnderson suggests, is a good approach.



(One thing I have not ever tried is a block-level reduction in shared memory, followed by one atomicAdd() per block on the global counter.)

#4
Posted 03/30/2012 12:58 PM   
[quote name='DrAnderson42' date='30 March 2012 - 02:15 PM' timestamp='1333109734' post='1389888']
Use a reduction instead. thrust has reduction routines already programmed for you.
[/quote]

Do you have some examples of this, or some links to check? I am also interesting by this idea of reduction routines.
[quote name='DrAnderson42' date='30 March 2012 - 02:15 PM' timestamp='1333109734' post='1389888']

Use a reduction instead. thrust has reduction routines already programmed for you.





Do you have some examples of this, or some links to check? I am also interesting by this idea of reduction routines.

#5
Posted 03/30/2012 01:05 PM   
A nice introductory example of reduction can be found in Mark Harris's talk from SC2007:

http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

There is a lot of stuff at the beginning of the slides about CUDA optimization, but Example #2 in the talk is a description of parallel reduction and how to implement it for CUDA.
A nice introductory example of reduction can be found in Mark Harris's talk from SC2007:



http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf



There is a lot of stuff at the beginning of the slides about CUDA optimization, but Example #2 in the talk is a description of parallel reduction and how to implement it for CUDA.

#6
Posted 03/30/2012 01:15 PM   
Indeed there is a lot of interesting stuff. Thank you very much.

Now I need to assimilate parallel reduction, but I guess I won't finish today /wink.gif' class='bbc_emoticon' alt=';)' />
Indeed there is a lot of interesting stuff. Thank you very much.



Now I need to assimilate parallel reduction, but I guess I won't finish today /wink.gif' class='bbc_emoticon' alt=';)' />

#7
Posted 03/30/2012 02:05 PM   
[quote name='seibert' date='30 March 2012 - 01:58 PM' timestamp='1333112296' post='1389903']
Both approaches make sense depending on how many threads will need to increment the global counter. If only a small fraction of threads will increment the counter, atomicAdd() is quick and easy. If a large number of threads need to increment the counter, then a reduction-based algorithm, like DrAnderson suggests, is a good approach.

(One thing I have not ever tried is a block-level reduction in shared memory, followed by one atomicAdd() per block on the global counter.)
[/quote]

I recently tried something similar with some noticeably improvement, but it seem to vary with my access patterns which were quite unpredicable. Logically it made sense to reduce everything on a block level ( from a bandwidth perspective) and then reducing using atomics on a global level... Whenever your access patterns begin to be data-dependent you are on a slippery slope with regards to data-locality : /
[quote name='seibert' date='30 March 2012 - 01:58 PM' timestamp='1333112296' post='1389903']

Both approaches make sense depending on how many threads will need to increment the global counter. If only a small fraction of threads will increment the counter, atomicAdd() is quick and easy. If a large number of threads need to increment the counter, then a reduction-based algorithm, like DrAnderson suggests, is a good approach.



(One thing I have not ever tried is a block-level reduction in shared memory, followed by one atomicAdd() per block on the global counter.)





I recently tried something similar with some noticeably improvement, but it seem to vary with my access patterns which were quite unpredicable. Logically it made sense to reduce everything on a block level ( from a bandwidth perspective) and then reducing using atomics on a global level... Whenever your access patterns begin to be data-dependent you are on a slippery slope with regards to data-locality : /

#8
Posted 03/30/2012 03:30 PM   
Scroll To Top