Many threads updating a single global variable

I am new to CUDA. In my program, Multiple threads have to update a global variable count, which is initially 0. Each thread takes a value and if that value is not equal to zero, the thread should increment the global variable count by 1. I tried it in the code. For the array containing 1665 non-zero elements, i get result as 38, 39, 31, etc. Each time the value differs. How do i synchronize the writes by threads to the same location. Thank you

Use atomics, example atomicInc etc,. Check the programming guide for reference.

Using atomicInc in every thread will serialize your execution and it will run very slow. Use a reduction instead. thrust has reduction routines already programmed for you.

Both approaches make sense depending on how many threads will need to increment the global counter. If only a small fraction of threads will increment the counter, atomicAdd() is quick and easy. If a large number of threads need to increment the counter, then a reduction-based algorithm, like DrAnderson suggests, is a good approach.

(One thing I have not ever tried is a block-level reduction in shared memory, followed by one atomicAdd() per block on the global counter.)

Do you have some examples of this, or some links to check? I am also interesting by this idea of reduction routines.

A nice introductory example of reduction can be found in Mark Harris’s talk from SC2007:

http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

There is a lot of stuff at the beginning of the slides about CUDA optimization, but Example #2 in the talk is a description of parallel reduction and how to implement it for CUDA.

Indeed there is a lot of interesting stuff. Thank you very much.

Now I need to assimilate parallel reduction, but I guess I won’t finish today External Image

I recently tried something similar with some noticeably improvement, but it seem to vary with my access patterns which were quite unpredicable. Logically it made sense to reduce everything on a block level ( from a bandwidth perspective) and then reducing using atomics on a global level… Whenever your access patterns begin to be data-dependent you are on a slippery slope with regards to data-locality : /