Awkward error on simple addition inside thread

Hello guys !
Sorry for my english, I’m Fench.

I’m stuck on this error for several days.

// my host code
calculError<<<1, 2>>>();

// my device code
__global__ void calculError()

{
    int n = threadIdx.x;
    
    for(int w = 0; w < MyVar; w++)
    {
        if(n == 0)
            errors[w] += 1.1f;
        else
            errors[w] += 1.1f;
    }
}

This first test code gives me 2.2 for all my errors, this is is normal BUT this :

// my host code
calculError<<<1, 2>>>();

// my device code
__global__ void calculError()

{
    int n = threadIdx.x;
    
    for(int w = 0; w < MyVar; w++)
    {
        errors[w] += 1.1f;
    }
}

This gives me only 1.1 for all my error !

How is that possible ???

Someone can explain this ? Please, help me.

You don’t give nearly enough information (what is MyVar?), but you probably want to look into using atomicAdd() instead of += .

MyVar is :

const int = 25;

I don’t know what information you want.

I don’t use shared memory, error is a float* in device memory, I have almost same code in other global function and it works perfectly, error address doesn’t change.

Really, I don’t understand this problem.

the problem in you code that it raises the race condition, so it’s incorrect and results are UB (undefined behavior). you need to use atomic operations to avoid that

Ok, Thank you for your replay.
I understand the problem of race condition but I have a question on this.

Why I have always the problem when I do this ?

__global__ void calculErrorPerceptronForOutputLayer(float* error)

{
    int n = threadIdx.x;
    if(n == 0)
    {
        for(int i = 0; i < 7000000; i++) // equivalent to 3 seconds sleep
        {
            int a = 32;
        }
    }
    *errorsInHiddenLayers += 1.1f;
    printf("neurons(%i,%i) : %f\n",1, 0, *errorsInHiddenLayers);
}

Also, I don’t use __syncthreads();

I decide to rework my code instead of use atomicAdd() because :

- atomicAdd take more time that a simple addition
- atomicAdd is unavailable on compute_50 architecture
- It's dumb to use multi-threading for execute atomicAdd one after another.

Matthieu, the issue in your second block of code is that it changes nothing. The compiler will likely optimize out the inner for loop first off.

Secondly, just because you introduce an arbitrary delay doesn’t change the fact that you will have data race issues.

You don’t need to use syncthreads in this case because it provides you nothing.

There is a reason why atomicAdd takes more time compared to a simple addition, atomic makes an operation sequential to preserve data integrity whereas without it, the operation doesn’t take this into account. This is inherent to the issues with a data race.

Why the thread with n = 1 don’t compute error et return result in the pointer during the thread with n = 0 sleep and after the thread with n = 0 compute the error with the new value in the pointer?

I think we need to start clean.

What are you trying to accomplish?

Thank you but now it works well. I’m working on neural network.
I change my code for call a different error in each thread.