I was under the assumption that if many threads need to write to the same location in global memory, the scheduler would serialize and protect the integrity of all writes? Such as:
[codebox]
int tx = …;
int ty = …;
int i = threadIdx.x;
// thread i in block (tx,ty) does some calculations to obtain some kind of update value
int val = …;
gmem[tx][ty] += val;
[/codebox]
But perhaps my assumption is wrong. I thought the += would work, but it doesn’t?
Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I’m screwed. Ugh, just read fp one only exists for device 2.0 or higher. I’m at 1.3, so I’m sure that’s a no go.
I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.
I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there’s a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don’t write properly.
Yes, these are double values I’m updating with. If these were integers, I think I would be okay.
Yes, that’s my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.
This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:
Even simpler, if the sum at the end of your kernel is insignificant compared to the rest of the kernel runtime, you can even make thread 0 do the shared memory sum all by itself. No point in making your code prematurely complicated for a 5% speed boost.
External Image The same problem also occurs to my code. could you explain a bit more on the assumption that atomicCAS makes on the warp schedular’s behavior and the divergence?