race conditions clarify!

Hello,

I am having a hard time recognizing race conditions ,although I am familiar with the definition.
It happens when multiple writes happen to the same memory location .It is due to the fact that threads run in parallel and we don’t know which writes first/last.

I have an example where I am comparing cuda, c , openmp:

cuda:

int  index = blockIdx.x * blockDim.x + threadIdx.x;

	if (index < N)
            c[index] += a[ index ];

c:

for ( i = 0; i < N; i++ )
	{	
	    c[i] += a[ i ];
		
	}

openmp:

#pragma omp parallel for shared(a,c) private( i )
        for ( i = 0; i < N; i++ )
	{	
	    c[i] += a[ i ];
		
	}

All the above give the same results , so we don’t have race conditions.

But shouldn’t we have ??Since we are writing on c ( hence,the same memory location ).

Also, if I change in the above codes :

c[i] += a[i] ( or c[index] += a[index] )

with:

r += a[i]  ( r += a[index] )

Then , the C and openMP implementations gives the same results ,but CUDA gives different.

Shouldn’t we have race conditions here also?

Please, someone explain me this because it drives me crazy!!
Any directions on the race condition thing !

Thank you!

c[index] is not “the same location”. The location varies by thread. You may need to review basic c arrays concepts.

I think this is easier example.
32 times adding should be 32?

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void simpleadd(int *sum)
{

//      atomicInc((unsigned int *)sum, (unsigned int)10000);
    *sum=*sum+1;

}

int main(void)
{

    int *sum;
    dim3 BlockPerGrid(1, 1, 1);
    dim3 ThreadPerBlock(8, 4, 1); // 32 times 

    cudaMallocManaged(&sum, sizeof(int));

    *sum = 0;

    simpleadd <<< BlockPerGrid, ThreadPerBlock >>> ((int *) sum);
    cudaDeviceSynchronize();    // We must WAIT until the end!!

    printf("Sum=%d\n", *sum);

    exit(0);

}

$ nvcc -gencode=arch=compute_52,code=compute_52 -O2 race.cu -o race
$ ./race
Sum=1

Ok, since c[ 0 ] is different from c[ 1 ] … we don’t have race condition because we are writting in different memory locations.

But , there isn’t a problem when we are reading the c[ 0 ] ,then add a[ 0 ] and add it again to c[ 0 ] ,because we just read from c[ 0 ] , right?

( c[ 0 ] = c[ 0 ] + a[ 0 ] )

Ok, here we are having a race condition since we are writting to the same memory location.

So, looking at my first post , at the second example where I am using variable “r” ( not array as before ) , we are going to have a race condition , right?

But ,then why C and openMP code gives same result and cuda not?

Thank you!

You should compare hardware and hardware.
And compare software and software.
How do you think?
CUDA vs. SMP, which is close to the truth or a dream machine?

I am sorry I didn’t understand how this answer my question about my second example and the results.