I don’t use shared memory, error is a float* in device memory, I have almost same code in other global function and it works perfectly, error address doesn’t change.
the problem in you code that it raises the race condition, so it’s incorrect and results are UB (undefined behavior). you need to use atomic operations to avoid that
Ok, Thank you for your replay.
I understand the problem of race condition but I have a question on this.
Why I have always the problem when I do this ?
__global__ void calculErrorPerceptronForOutputLayer(float* error)
{
int n = threadIdx.x;
if(n == 0)
{
for(int i = 0; i < 7000000; i++) // equivalent to 3 seconds sleep
{
int a = 32;
}
}
*errorsInHiddenLayers += 1.1f;
printf("neurons(%i,%i) : %f\n",1, 0, *errorsInHiddenLayers);
}
Also, I don’t use __syncthreads();
I decide to rework my code instead of use atomicAdd() because :
- atomicAdd take more time that a simple addition
- atomicAdd is unavailable on compute_50 architecture
- It's dumb to use multi-threading for execute atomicAdd one after another.
Matthieu, the issue in your second block of code is that it changes nothing. The compiler will likely optimize out the inner for loop first off.
Secondly, just because you introduce an arbitrary delay doesn’t change the fact that you will have data race issues.
You don’t need to use syncthreads in this case because it provides you nothing.
There is a reason why atomicAdd takes more time compared to a simple addition, atomic makes an operation sequential to preserve data integrity whereas without it, the operation doesn’t take this into account. This is inherent to the issues with a data race.
Why the thread with n = 1 don’t compute error et return result in the pointer during the thread with n = 0 sleep and after the thread with n = 0 compute the error with the new value in the pointer?