Race Conditions Deviations on cumulative sum calculation
Hi

We are "trying" to implement an Algorithm on GPU, but we are suffering from some race conditions.

The kernel below is a computation step, which is called within a loop from the host (see below).

The calculations performed here are:
[code]
r = y - Ax + b/m*r
tau = alpha*1/sqrt(m)*norm(r,2);
[/code]
where r,y,Ax are vectors and m,alpha,b are scalars

Due to some "optimizations", we do perform the calculations only on certains indecies of r, which are specified in the bool array d_mask.

But with this implementation we get some race conditions, we do not know where they come from.
On my Laptop (HP 8440p with NVidia NVS 3100M - Computecapabilty 1.2) the cumulative sum (*d_tau_add_tmp) is not always properly calculated. Most of the time it is correct but sometimes (1:10) the value is a little bit off (around +-20%)
We tested the same code on a Lenovo w520 (Nvidia m2000), but there the race conditions do not occur.

We are aware that the code could be substantially improved in terms of performance, but we would like to understand the reason behind the race conditions first.

We are looking forward for your support!

Kernel:
[code]
__global__ void GPUampP1(const float* d_y, float* d_r, float* d_r2, float* d_Ax, float* d_b, float* d_tau, const float alpha, const int n, const bool* d_mask, const int m, float* d_tau_add_tmp)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if(idx < n)
{
// r = y - Ax + b/m*r
if(d_mask[idx])
{
d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;
} else
{
d_r[idx] = 0;
}
}
__syncthreads();

//tau = alpha*1/sqrt(m)*norm(r,2);
if(idx < n)
{
if(d_mask[idx])
d_r2[idx] = d_r[idx] * d_r[idx];
}
__syncthreads();

// Racecondition seems to be in this part:
if(idx < 1);
{
*d_tau_add_tmp = 0;
for(int i = 0;i < n; i++)
if(d_mask[i])
*d_tau_add_tmp += d_r2[i]; //<<-- the sum d_tau_add_tmp is not always correct (just little deviation)

float f = *d_tau_add_tmp;
*d_tau = sqrt(f/m) * alpha;
}
__syncthreads();
}
[/code]

Host:
[code]
for(int i = 0; i < len; i++)
{
// some other kernel calls

GPUampP1<<<blocksPerGrid, threadsPerBlock>>>(d_y,d_r,d_r2, d_Ax, d_b, d_tau, alpha, n, d_mask, m, d_tau_add_tmp);
cudaDeviceSynchronize();
getLastCudaError("kernel launch failure");
// some other kernel calls
}
[/code]
Hi



We are "trying" to implement an Algorithm on GPU, but we are suffering from some race conditions.



The kernel below is a computation step, which is called within a loop from the host (see below).



The calculations performed here are:



r = y - Ax + b/m*r

tau = alpha*1/sqrt(m)*norm(r,2);


where r,y,Ax are vectors and m,alpha,b are scalars



Due to some "optimizations", we do perform the calculations only on certains indecies of r, which are specified in the bool array d_mask.



But with this implementation we get some race conditions, we do not know where they come from.

On my Laptop (HP 8440p with NVidia NVS 3100M - Computecapabilty 1.2) the cumulative sum (*d_tau_add_tmp) is not always properly calculated. Most of the time it is correct but sometimes (1:10) the value is a little bit off (around +-20%)

We tested the same code on a Lenovo w520 (Nvidia m2000), but there the race conditions do not occur.



We are aware that the code could be substantially improved in terms of performance, but we would like to understand the reason behind the race conditions first.



We are looking forward for your support!



Kernel:



__global__ void GPUampP1(const float* d_y, float* d_r, float* d_r2, float* d_Ax, float* d_b, float* d_tau, const float alpha, const int n, const bool* d_mask, const int m, float* d_tau_add_tmp)

{

int idx = blockDim.x * blockIdx.x + threadIdx.x;



if(idx < n)

{

// r = y - Ax + b/m*r

if(d_mask[idx])

{

d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;

} else

{

d_r[idx] = 0;

}

}

__syncthreads();



//tau = alpha*1/sqrt(m)*norm(r,2);

if(idx < n)

{

if(d_mask[idx])

d_r2[idx] = d_r[idx] * d_r[idx];

}

__syncthreads();



// Racecondition seems to be in this part:

if(idx < 1);

{

*d_tau_add_tmp = 0;

for(int i = 0;i < n; i++)

if(d_mask[i])

*d_tau_add_tmp += d_r2[i]; //<<-- the sum d_tau_add_tmp is not always correct (just little deviation)



float f = *d_tau_add_tmp;

*d_tau = sqrt(f/m) * alpha;

}

__syncthreads();

}




Host:



for(int i = 0; i < len; i++)

{

// some other kernel calls



GPUampP1<<<blocksPerGrid, threadsPerBlock>>>(d_y,d_r,d_r2, d_Ax, d_b, d_tau, alpha, n, d_mask, m, d_tau_add_tmp);

cudaDeviceSynchronize();

getLastCudaError("kernel launch failure");

// some other kernel calls

}

#1
Posted 04/07/2012 02:40 PM   
Hello,

Your code will work as long as you use only 1 block:

The race condition comes from the fact that you can not synchronize threads in different blocks. The problem is

[code]


//tau = alpha*1/sqrt(m)*norm(r,2);
if(idx < n)
{
if(d_mask[idx])
d_r2[idx] = d_r[idx] * d_r[idx];
}
__syncthreads();
[/code]

At this point threads in different blocks are supposed to upate d_r2[idx], and then proceed to the add.

You have 2 possibilities:

1) you put in a different kernel this part :

[code]
{
*d_tau_add_tmp = 0;
for(int i = 0;i < n; i++)
if(d_mask[i])
*d_tau_add_tmp += d_r2[i]; //<<-- the sum d_tau_add_tmp is not always correct (just little deviation)

float f = *d_tau_add_tmp;
*d_tau = sqrt(f/m) * alpha;
}
[/code]

2) the other possibility is in the programming guide at the threadfence() function there is an example in which you can make the last part to be executed only after the d_r2[idx] = d_r[idx] * d_r[idx]; becomes visible to all threads.
Hello,



Your code will work as long as you use only 1 block:



The race condition comes from the fact that you can not synchronize threads in different blocks. The problem is









//tau = alpha*1/sqrt(m)*norm(r,2);

if(idx < n)

{

if(d_mask[idx])

d_r2[idx] = d_r[idx] * d_r[idx];

}

__syncthreads();




At this point threads in different blocks are supposed to upate d_r2[idx], and then proceed to the add.



You have 2 possibilities:



1) you put in a different kernel this part :





{

*d_tau_add_tmp = 0;

for(int i = 0;i < n; i++)

if(d_mask[i])

*d_tau_add_tmp += d_r2[i]; //<<-- the sum d_tau_add_tmp is not always correct (just little deviation)



float f = *d_tau_add_tmp;

*d_tau = sqrt(f/m) * alpha;

}




2) the other possibility is in the programming guide at the threadfence() function there is an example in which you can make the last part to be executed only after the d_r2[idx] = d_r[idx] * d_r[idx]; becomes visible to all threads.

#2
Posted 04/07/2012 05:21 PM   
Thank you very much for your support!

That solved part of the problem.

Perviously it did not even work with only one block.

Note the semiokolon...
[code]if(idx < 1);[/code]
... beginners mistake ...

Problem now solved, thank you...
Thank you very much for your support!



That solved part of the problem.



Perviously it did not even work with only one block.



Note the semiokolon...

if(idx < 1);


... beginners mistake ...



Problem now solved, thank you...

#3
Posted 04/07/2012 06:32 PM   
You should also considering fusing the first two ifs:

[code]
if(idx < n)
{
// r = y - Ax + b/m*r
if(d_mask[idx])
{
d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;
} else
{
d_r[idx] = 0;
}
}
__syncthreads();

//tau = alpha*1/sqrt(m)*norm(r,2);
if(idx < n)
{
if(d_mask[idx])
d_r2[idx] = d_r[idx] * d_r[idx];
}
__syncthreads();

[/code]

[code]
if(idx < n)
{
// r = y - Ax + b/m*r
if(d_mask[idx])
{
d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;
d_r2[idx] = d_r[idx] * d_r[idx];
} else
{
d_r[idx] = 0;
}
}
__syncthreads();

[/code]

Even better will be to compute d_r2 for all idx and eliminate the extra if in the reduction loop.
You should also considering fusing the first two ifs:





if(idx < n)

{

// r = y - Ax + b/m*r

if(d_mask[idx])

{

d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;

} else

{

d_r[idx] = 0;

}

}

__syncthreads();



//tau = alpha*1/sqrt(m)*norm(r,2);

if(idx < n)

{

if(d_mask[idx])

d_r2[idx] = d_r[idx] * d_r[idx];

}

__syncthreads();








if(idx < n)

{

// r = y - Ax + b/m*r

if(d_mask[idx])

{

d_r[idx] = d_y[idx] - d_Ax[idx] + *d_b / m * d_r[idx] ;

d_r2[idx] = d_r[idx] * d_r[idx];

} else

{

d_r[idx] = 0;

}

}

__syncthreads();






Even better will be to compute d_r2 for all idx and eliminate the extra if in the reduction loop.

#4
Posted 04/07/2012 07:58 PM   
Just keep in mind that when you use this kind of contruction:

[code]
d_r2[idx] = d_r[idx] * d_r[idx];
}
__syncthreads();


if(idx < 1);
{
[/code]

it will only guaranties the job is done only in the block which contains the thread with idx=0. The other blocks might still be not executed. In other words this part
[code]
if(idx < 1);
{
[/code] will be executed before all blocks finished updating the d_r2 array. The semicolon was a typo, but this code will only give the right results with 1 block. In order to get the right result you need to have the line *d_tau_add_tmp += d_r2[i]; executed by the last block which is the last ran.
Just keep in mind that when you use this kind of contruction:





d_r2[idx] = d_r[idx] * d_r[idx];

}

__syncthreads();





if(idx < 1);

{




it will only guaranties the job is done only in the block which contains the thread with idx=0. The other blocks might still be not executed. In other words this part



if(idx < 1);

{
will be executed before all blocks finished updating the d_r2 array. The semicolon was a typo, but this code will only give the right results with 1 block. In order to get the right result you need to have the line *d_tau_add_tmp += d_r2[i]; executed by the last block which is the last ran.

#5
Posted 04/08/2012 11:37 AM   
Hi,
just a few hints:
- don't use global memory if not necessary, ie do all your computation and store the final result in global memory
and don't pass parameters through global mem
- in your case, except for reduction, the is no need of thread communication in the same block, so no need to syncthread
- you can avoid the else part by setting d_r array to 0 before calling your kernel
- for the sum the best way is to do a modified parallel sum on d_r in a separate kernel.
The parallel sum kernel is memory bound, so passing d_r to this kernel and adding a multiplication inside to get
r*r will not slow down the reduction. In this way you don't need to store d_r2 and save a lot of time in your GPUampP1 kernel.
You can also do a block wide parallel sum in your compute kernel and after finish the reduction with a separate kernel acting on the n/BLOCK_SIZE partial sums

The result will be something like that:
Device:
[code]

__global__ void GPUampP1(const float* d_y, float* d_r, float* d_Ax, const float b, float* d_tau, const float alpha, const int n, const bool* d_mask)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if(idx < n)
{
// r = y - Ax + b/m*r
if(d_mask[idx])
{
d_r[idx] = d_y[idx] - d_Ax[idx] + b / m * d_r[idx] ;
}
}
}
[/code]

Host:
[code]
for(int i = 0; i < len; i++)
{
// some other kernel calls
cudaMemset on d_r to zero it
GPUampP1<<<blocksPerGrid, threadsPerBlock>>>(d_y,d_r, d_Ax, b, n, d_mask);
getLastCudaError("kernel launch failure");
// Parallel sum on d_r computing ∑r^2 => f
tau = sqrt(f/m) * alpha;
// some other kernel calls
}[/code]
Hi,

just a few hints:

- don't use global memory if not necessary, ie do all your computation and store the final result in global memory

and don't pass parameters through global mem

- in your case, except for reduction, the is no need of thread communication in the same block, so no need to syncthread

- you can avoid the else part by setting d_r array to 0 before calling your kernel

- for the sum the best way is to do a modified parallel sum on d_r in a separate kernel.

The parallel sum kernel is memory bound, so passing d_r to this kernel and adding a multiplication inside to get

r*r will not slow down the reduction. In this way you don't need to store d_r2 and save a lot of time in your GPUampP1 kernel.

You can also do a block wide parallel sum in your compute kernel and after finish the reduction with a separate kernel acting on the n/BLOCK_SIZE partial sums



The result will be something like that:

Device:





__global__ void GPUampP1(const float* d_y, float* d_r, float* d_Ax, const float b, float* d_tau, const float alpha, const int n, const bool* d_mask)

{

int idx = blockDim.x * blockIdx.x + threadIdx.x;



if(idx < n)

{

// r = y - Ax + b/m*r

if(d_mask[idx])

{

d_r[idx] = d_y[idx] - d_Ax[idx] + b / m * d_r[idx] ;

}

}

}




Host:



for(int i = 0; i < len; i++)

{

// some other kernel calls

cudaMemset on d_r to zero it

GPUampP1<<<blocksPerGrid, threadsPerBlock>>>(d_y,d_r, d_Ax, b, n, d_mask);

getLastCudaError("kernel launch failure");

// Parallel sum on d_r computing ∑r^2 => f

tau = sqrt(f/m) * alpha;

// some other kernel calls

}

#6
Posted 04/09/2012 12:48 AM   
First of all, thanks a lot for your hints!

Nevertheless I do have some general questions:

[quote]- don't use global memory if not necessary, ie do all your computation and store the final result in global memory
and don't pass parameters through global mem[/quote]

-> I have fixed parameter, which do never change (such as alpha, n, m) -> so you propose to pass them as parameter (as I did)
But I have also some coefficients, which are recalculated on the GPU once every iteration (such as b and tau) -> Is there the use of global memory reasonable?


-> What is better for performance? using threadfence or calling a seperate kernel?

-> What about "caching" the access to global memory in shared memory? e.g. Copy first with a thread per block all needed global data d_y,d_Ax,etc to shared memory. Would this improve the performance or is the caching already done implicitly?

Thank you for your answers!
First of all, thanks a lot for your hints!



Nevertheless I do have some general questions:



- don't use global memory if not necessary, ie do all your computation and store the final result in global memory

and don't pass parameters through global mem




-> I have fixed parameter, which do never change (such as alpha, n, m) -> so you propose to pass them as parameter (as I did)

But I have also some coefficients, which are recalculated on the GPU once every iteration (such as b and tau) -> Is there the use of global memory reasonable?





-> What is better for performance? using threadfence or calling a seperate kernel?



-> What about "caching" the access to global memory in shared memory? e.g. Copy first with a thread per block all needed global data d_y,d_Ax,etc to shared memory. Would this improve the performance or is the caching already done implicitly?



Thank you for your answers!

#7
Posted 04/09/2012 09:51 AM   
[quote]
-> I have fixed parameter, which do never change (such as alpha, n, m) -> so you propose to pass them as parameter (as I did)
[/quote]
Yes

[quote]
But I have also some coefficients, which are recalculated on the GPU once every iteration (such as b and tau) -> Is there the use of global memory reasonable?
[/quote]
Yes, but the important thing is how you compute those parameters on GPU

[quote]
-> What is better for performance? using threadfence or calling a seperate kernel?
[/quote]
If you are doing a reduction (max, sum, ....) threadfence will not be sufficient. You need to implement (or use) a reduction kernel.

[quote]
-> What about "caching" the access to global memory in shared memory? e.g. Copy first with a thread per block all needed global data d_y,d_Ax,etc to shared memory. Would this improve the performance or is the caching already done implicitly?
[/quote]
Caching in shared memory is useful when you need to share data across threads of the same block. This is not your case and your access pattern is fully coalesced.


-> I have fixed parameter, which do never change (such as alpha, n, m) -> so you propose to pass them as parameter (as I did)



Yes





But I have also some coefficients, which are recalculated on the GPU once every iteration (such as b and tau) -> Is there the use of global memory reasonable?



Yes, but the important thing is how you compute those parameters on GPU





-> What is better for performance? using threadfence or calling a seperate kernel?



If you are doing a reduction (max, sum, ....) threadfence will not be sufficient. You need to implement (or use) a reduction kernel.





-> What about "caching" the access to global memory in shared memory? e.g. Copy first with a thread per block all needed global data d_y,d_Ax,etc to shared memory. Would this improve the performance or is the caching already done implicitly?



Caching in shared memory is useful when you need to share data across threads of the same block. This is not your case and your access pattern is fully coalesced.

#8
Posted 04/15/2012 12:06 PM   
Scroll To Top