Incorrect synchronization inside a "while" loop (occuring only in Release mode)

Greetings.

I have a kernel with a “while” loop, which iteratively updates elements of an array using information about neighbors
(only one neighbor in the sample code below). This loop stops when no element is changed at the current iteration.

Unfortunately, in some situations part of threads go out of this loop prematurely (like if they ignore synchronization barrier).
Some inputs are processed correctly every time, and other inputs (many of them) are processed incorrectly every time
(i.e. there are no stochastic factors). Strangely, this error occurs only in Release version while Debug version always
worked fine. More precisely, the CUDA compiler option “-G (Generate GPU Debug Information)” determines whether the
processing is correct. Arrays of size 32x32 or smaller are always processed correctly.

Here is a sample code:

__global__ void kernel(int *source, int size, unsigned char *result, unsigned char *alpha)
	{
		int x = threadIdx.x, y0 = threadIdx.y * 4;
		int i, y;
		__shared__ bool alpha_changed;

		// Zero intermediate array using margins for safe access to neighbors
		const int stride = MAX_SIZE + 2;
		for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
		{
			alpha[i] = 0;
		}
		__syncthreads();

		for (int bit = MAX_BITS - 1; bit >= 0; bit--)
		{
			__syncthreads();
	
			// Fill intermediate array with bit values from input array
			alpha_changed = true;
			alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
			alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
			alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
			alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
			__syncthreads();

			// The loop in question
			while (alpha_changed)
			{
				alpha_changed = false;
				__syncthreads();
				if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
				{
					alpha_changed = true;
					alpha[(x + 1) + (y0 + 1) * stride] = 1;
				}
				__syncthreads();
				if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
				{
					alpha_changed = true;
					alpha[(x + 1) + (y0 + 2) * stride] = 1;
				}
				__syncthreads();
				if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0)
				{
					alpha_changed = true;
					alpha[(x + 1) + (y0 + 3) * stride] = 1;
				}
				__syncthreads();
				if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0)
				{
					alpha_changed = true;
					alpha[(x + 1) + (y0 + 4) * stride] = 1;
				}
				__syncthreads();
			}
			__syncthreads();
			
			// Save result
			result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
			result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
			result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
			result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
			__syncthreads();
		}
	}

	// Run only 1 thread block, where size equals 64.
	kernel <<< 1, dim3(size, size / 4) >>> (source_gpu, size, result_gpu, alpha_gpu);

The expected result of this sample kernel is array, where each line can contain only contiguous intervals
of “1” values. But instead of this, I get some lines, where “0” and “1” are somehow alternated.

This error is reproduced on my mobile GPU GeForce 740M (Kepler), on Windows 7 x64 SP1, on either CUDA 6.0 or 6.5,
using either Visual C++ 2012 or 2013. I can also provide a sample Visual Studio project with the sample input array (i.e. which is processed incorrectly).

I have already tried different configurations of syncthreads(), fences and “volatile” qualifier, but this error
remained.

Any help is appreciated.

my preliminary take is that you have a race between reading/ writing of alpha
running racecheck on the release build may confirm (or refute) this

such a race would explain why the debug build works, but not the release build; and why it works for block sizes of 32 threads/ 1 warp, but not multiple warps

to get something like below, to work with a release build,

if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 1) * stride] = 1;
}

i personally would do something like:

int lint1, lint2; // local variable

while (alpha_changed)
{
lint1 = 0;
lint2 = 0;

if (threadIdx.x == 0)
{
alpha_changed = false;
}

if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
{
lint1 = 1;
}

if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
{
lint2 = 1;
}

__syncthreads();

if (lint1 > 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 1) * stride] = 1;
}

if (lint2 > 0)
{
alpha_changed = true;
alpha[(x + 1) + (y0 + 2) * stride] = 1;
}

__syncthreads();
}

you now have clear barriers between reading and writing of alpha at all times, that can be honoured properly by multiple warps
i have squashed the number of __syncthreads() by using multiple local variables for evaluation, but i suppose one would suffice

Thank you for quick response and for pointing to Racecheck tool (new to me). Racecheck found 13 hazards of different types.

I like your idea of clear barriers, but in this form it didn’t solved the problem and I still don’t understand why. Unexpectedly, Racecheck found 12 hazards for kernel corrected in this way (and took about 15 minutes - much longer than with the original kernel in question).

Fortunately, the answer of Roeland (see visual c++ - Incorrect synchronization inside a "while" loop (occuring only in Release mode) - Stack Overflow ) solved it. Although Racecheck found even more hazards (26), it doesn’t prevent the kernel to give the correct result (moreover, it also fixed my real kernel, which is much more complex).

I do not think it is wise to ignore race conditions reported by cuda-memcheck’s racecheck tool. Just because correct results happened to be achieved on a particular run with particular data on a particular GPU, this doesn’t mean the race conditions found will not cause incorrect results to be produced in other circumstances.

“Racecheck found 12 hazards for kernel corrected in this way (and took about 15 minutes - much longer than with the original kernel in question).”

it is only ‘natural’ for test-beds like memcheck and racecheck and others to take longer
the purpose of such tools is validation, not production/ throughput
its like taking the pilot’s vitals before sending him out to fly the stealth jet
and as njuffa points out, you now want to have a pilot with poor vitals fly the jet, when you ‘sign off’ on code with known races

in hindsight, the code i posted still has a race; you need a syncthreads between reading and writing alpha_changed

while (alpha_changed)
{
lint1 = 0;
lint2 = 0;

if (threadIdx.x == 0)
{
alpha_changed = false;
}

like so:

while (alpha_changed)
{
lint1 = 0;
lint2 = 0;

__syncthreads();

if (threadIdx.x == 0)
{
alpha_changed = false;
}

otherwise, its another read/ write race

With the third syncthreads(), which you suggested, the kernel now gives the correct result. Thank you.

But even for this version racecheck still finds 12 hazards. Following the advice of njuffa, I tried to clarify these hazards: I run racecheck with Debug executable (in order to see line numbers for hazards) and got only 9 hazards, but the kernel output became incorrect. These 9 hazards inidicate that syncthreads are ignored again. E.g. there is race between reading (“while (alpha_changed)”) and writing (“alpha_changed = false”) and (“alpha_changed = true”) in spite of syncthreads() calls between them. So should I really trust this report of racecheck? How these syncthreads() became invalid during racecheck run?

__syncthreads() does not act a general code-motion barrier. So it seems possible to have a different number of races for debug versus release builds due to code motion. I think it is also possible that the race-check tool cannot find all the races all the time. But at least one should fix the races it can find and does report.

I have never seen the tool report false positives. So if it reports a race, it stands to reason that it does exist. Some of the races that I have seen reported that were a bit difficult to understand occurred on a backward loop closing branch, i.e. the race was between an operation at the end of a loop in the current iteration and a second operation at the start of the loop in the next loop iteration.

Why synchronization barriers are crossed in debug version? I have found a message of Mark Harris [url]compiler construction - What is the difference when compiling a CUDA program with or without option -g -G - Stack Overflow, where he said that with -G option (present in Debug mode) such optimizations (with code motion) are probably (?) disabled.

Anyway, if syncthreads() does not help (I have them at the beginning of the loop and at the end already), what are other possible means to fix such races?

I understood your previous comment to mean that there are fewer races with the debug version, not more, so I don’t see a contradiction yo what I stated? Yes, -G pretty much disables all compiler optimizations. Even in release builds, the compiler will not move shared memory accesses across a call to __syncthreads(), at least that is my understanding.

I would suggest studying the code along with the report from the racecheck tool. It is imperative to understand why those races exist, instead of simply adding additional __syncthreads(). __syncthreads() is a reliable barrier. But it needs to be used correctly. For example, make sure you never have a __syncthreads() call in a divergent code flow, or else all bets are off (undefined behavior). From a cursory look, the while() loop itself seems to represent a divergent code flow, as it is data driven and not all threads are guaranteed to execute this loop (different data for different threads.

“Even in release builds, the compiler will not move shared memory accesses across a call to __syncthreads(), at least that is my understanding.”

i would concur; i would also like to think that __syncthreads() as barrier is in ways similar to memory barriers of (multi-threaded) host code - not necessarily guaranteeing order of memory commits, but guaranteeing that/ when memory accesses would be committed to

perhaps it is equally important to understand why the particular tool at the time complains, or sees something as an anomaly, such that one can also apply own judgment
some warnings one might ignore, as one might feel that the tool is over-prudent, and that the underlying code is not in danger

there is still one race the tool might complain about:
multiple threads may set the rerun (alpha_changed_ flag, depending on their own outcomes - i.e. multiple threads attempting to write to the same address at the same time
generally, this constitutes as a race; i might not view it that way, as you are essentially writing the same value, and thus rather care about the value being set, and not the actual value
the prudent solution may be to use atomics, or use a warp vote, to set the rerun flag (alpha_changed)
personally, i feel this is overkill
you could easily test this: see if the remaining races all point to the same line (very possible), and see whether an atomic increment of alpha_changed solves this race

Thank you all for useful remarks.

Here is the final code, which works correctly and has only WAW hazards (writing “true” vs writing another “true”).
I agree that solving them in my case is overkill.

__global__ void kernel_Test(int *source, int size, unsigned char *result, unsigned char *alpha)
{
	int x = threadIdx.x, y0 = threadIdx.y * 4, i, y, lint1, lint2, lint3, lint4;
	__shared__ bool alpha_changed;

	// Zero intermediate array using margins for safe access to neighbors
	const int stride = MAX_SIZE + 2;
	for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
	{
		alpha[i] = 0;
	}
	__syncthreads();

	for (int bit = MAX_BITS - 1; bit >= 0; bit--)
	{
		__syncthreads();
		
		// Fill intermediate array with bit values from input array
		if (threadIdx.x + threadIdx.y == 0) alpha_changed = true;
		alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
		alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
		alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
		alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
		__syncthreads();

		// The loop in question
		while (alpha_changed)
		{
			__syncthreads();
			lint1 = 0; lint2 = 0; lint3 = 0; lint4 = 0;
			if (threadIdx.x + threadIdx.y == 0) alpha_changed = false;
			if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0) lint1 = 1;
			if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0) lint2 = 1;
			if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0) lint3 = 1;
			if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0) lint4 = 1;
			__syncthreads();

			if (lint1)
			{
				alpha_changed = true;
				alpha[(x + 1) + (y0 + 1) * stride] = 1;
			}
			if (lint2)
			{
				alpha_changed = true;
				alpha[(x + 1) + (y0 + 2) * stride] = 1;
			}
			if (lint3)
			{
				alpha_changed = true;
				alpha[(x + 1) + (y0 + 3) * stride] = 1;
			}
			if (lint4)
			{
				alpha_changed = true;
				alpha[(x + 1) + (y0 + 4) * stride] = 1;
			}
			__syncthreads();
		}

		// Save result
		result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
		result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
		result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
		result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
		__syncthreads();
	}
}