Threads and Race Condition
Hello everyone.

I was under the assumption that if many threads need to write to the same location in global memory, the scheduler would serialize and protect the integrity of all writes? Such as:

[codebox]
int tx = ...;
int ty = ...;

int i = threadIdx.x;

// thread i in block (tx,ty) does some calculations to obtain some kind of update value
int val = ...;

gmem[tx][ty] += val;
[/codebox]

But perhaps my assumption is wrong. I thought the += would work, but it doesn't?

Thanks,
Daniel
Hello everyone.



I was under the assumption that if many threads need to write to the same location in global memory, the scheduler would serialize and protect the integrity of all writes? Such as:



[codebox]

int tx = ...;

int ty = ...;



int i = threadIdx.x;



// thread i in block (tx,ty) does some calculations to obtain some kind of update value

int val = ...;



gmem[tx][ty] += val;

[/codebox]



But perhaps my assumption is wrong. I thought the += would work, but it doesn't?



Thanks,

Daniel

#1
Posted 06/08/2010 03:05 PM   
That is exactly what atomic functions are for. Check out Appendix B.10 of the Programming Guide.
That is exactly what atomic functions are for. Check out Appendix B.10 of the Programming Guide.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 06/08/2010 03:15 PM   
[quote name='tera' post='1070191' date='Jun 8 2010, 10:15 AM']That is exactly what atomic functions are for. Check out Appendix B.10 of the Programming Guide.[/quote]

Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I'm screwed. Ugh, just read fp one only exists for device 2.0 or higher. I'm at 1.3, so I'm sure that's a no go.
[quote name='tera' post='1070191' date='Jun 8 2010, 10:15 AM']That is exactly what atomic functions are for. Check out Appendix B.10 of the Programming Guide.



Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I'm screwed. Ugh, just read fp one only exists for device 2.0 or higher. I'm at 1.3, so I'm sure that's a no go.

#3
Posted 06/08/2010 03:22 PM   
[quote name='zenosparadox' post='1070195' date='Jun 8 2010, 09:22 AM']Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I'm screwed. Ugh, just read fp one only exists for device 2.0 or higher. I'm at 1.3, so I'm sure that's a no go.[/quote]

I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.
[quote name='zenosparadox' post='1070195' date='Jun 8 2010, 09:22 AM']Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I'm screwed. Ugh, just read fp one only exists for device 2.0 or higher. I'm at 1.3, so I'm sure that's a no go.



I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.

#4
Posted 06/08/2010 03:31 PM   
[quote name='seibert' post='1070200' date='Jun 8 2010, 10:31 AM']I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.[/quote]

I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there's a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don't write properly.

Yes, these are double values I'm updating with. If these were integers, I think I would be okay.
[quote name='seibert' post='1070200' date='Jun 8 2010, 10:31 AM']I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.



I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there's a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don't write properly.



Yes, these are double values I'm updating with. If these were integers, I think I would be okay.

#5
Posted 06/08/2010 03:44 PM   
Seibert has a point here. However, if you still need an atomicAdd for double values, try this:

[code]__device__ double atomicAdd(double *address, double value)
{
long long oldval, newval, readback;

oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}

return __longlong_as_double(oldval);
}[/code]
It works in global memory, but not in shared. Fermi's float atomicAdd() would of course be a lot faster.
Seibert has a point here. However, if you still need an atomicAdd for double values, try this:



__device__ double atomicAdd(double *address, double value)

{

long long oldval, newval, readback;



oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}



return __longlong_as_double(oldval);

}


It works in global memory, but not in shared. Fermi's float atomicAdd() would of course be a lot faster.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 06/08/2010 03:45 PM   
[quote name='tera' post='1070208' date='Jun 8 2010, 10:45 AM']Seibert has a point here. However, if you still need an atomicAdd for double values, try this:

[code]__device__ double atomicAdd(double *address, double value)
{
long long oldval, newval, readback;

oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}

return __longlong_as_double(oldval);
}[/code]
It works in global memory, but not in shared. Fermi's float atomicAdd() would of course be a lot faster.[/quote]

Thanks, tera. I will take a look at this.
[quote name='tera' post='1070208' date='Jun 8 2010, 10:45 AM']Seibert has a point here. However, if you still need an atomicAdd for double values, try this:



__device__ double atomicAdd(double *address, double value)

{

long long oldval, newval, readback;



oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}



return __longlong_as_double(oldval);

}


It works in global memory, but not in shared. Fermi's float atomicAdd() would of course be a lot faster.



Thanks, tera. I will take a look at this.

#7
Posted 06/08/2010 03:57 PM   
[quote name='zenosparadox' post='1070206' date='Jun 8 2010, 09:44 AM']I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there's a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don't write properly.[/quote]

Yes, that's my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.

This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:

[url="http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf"]http://developer.download.nvidia.com/compu...c/reduction.pdf[/url]
[quote name='zenosparadox' post='1070206' date='Jun 8 2010, 09:44 AM']I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there's a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don't write properly.



Yes, that's my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.



This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:



http://developer.download.nvidia.com/compu...c/reduction.pdf

#8
Posted 06/08/2010 04:10 PM   
[quote][code]__device__ double atomicAdd(double *address, double value)
{
long long oldval, newval, readback;

oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}

return __longlong_as_double(oldval);
}[/code][/quote]

There is lot of chance that "atomicCAS" spin -- causing infinite loops... --- It assumes certain behaviour on warp scheduler and warp divergence...
I have seen such deadlocks before...
__device__ double atomicAdd(double *address, double value)

{

long long oldval, newval, readback;



oldval = __double_as_longlong(*address);

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {

oldval = readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}



return __longlong_as_double(oldval);

}




There is lot of chance that "atomicCAS" spin -- causing infinite loops... --- It assumes certain behaviour on warp scheduler and warp divergence...

I have seen such deadlocks before...

Ignorance Rules; Knowledge Liberates!

#9
Posted 06/09/2010 01:22 PM   
I can only see that it could livelock, I don't see how it could cause a deadlock.

Do you have a better solution?
I can only see that it could livelock, I don't see how it could cause a deadlock.



Do you have a better solution?

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#10
Posted 06/09/2010 01:30 PM   
[quote name='seibert' post='1070219' date='Jun 8 2010, 10:10 AM']Yes, that's my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.

This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:

[url="http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf"]http://developer.download.nvidia.com/compu...c/reduction.pdf[/url][/quote]

Even simpler, if the sum at the end of your kernel is insignificant compared to the rest of the kernel runtime, you can even make thread 0 do the shared memory sum all by itself. No point in making your code prematurely complicated for a 5% speed boost.
[quote name='seibert' post='1070219' date='Jun 8 2010, 10:10 AM']Yes, that's my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.



This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:



http://developer.download.nvidia.com/compu...c/reduction.pdf



Even simpler, if the sum at the end of your kernel is insignificant compared to the rest of the kernel runtime, you can even make thread 0 do the shared memory sum all by itself. No point in making your code prematurely complicated for a 5% speed boost.

#11
Posted 06/09/2010 03:17 PM   
[quote name='Sarnath' date='09 June 2010 - 08:22 AM' timestamp='1276089779' post='1070731']
There is lot of chance that "atomicCAS" spin -- causing infinite loops... --- It assumes certain behaviour on warp scheduler and warp divergence...
I have seen such deadlocks before...
[/quote]
/shock.gif' class='bbc_emoticon' alt=':shock:' /> The same problem also occurs to my code. could you explain a bit more on the assumption that atomicCAS makes on the warp schedular's behavior and the divergence?
[quote name='Sarnath' date='09 June 2010 - 08:22 AM' timestamp='1276089779' post='1070731']

There is lot of chance that "atomicCAS" spin -- causing infinite loops... --- It assumes certain behaviour on warp scheduler and warp divergence...

I have seen such deadlocks before...



/shock.gif' class='bbc_emoticon' alt=':shock:' /> The same problem also occurs to my code. could you explain a bit more on the assumption that atomicCAS makes on the warp schedular's behavior and the divergence?

#12
Posted 04/30/2012 02:13 AM   
Scroll To Top