Possible synchronization bug on Maxwell?

I am implementing some sort of work distribution mechanism.
I have a nested work list, like {{w1, w2}, {w3, w4, w5, w6}, {w7}, {w8, w9, w10}}
The work distribution mechanism goes as follows:

  1. Store the work amount in each sub-list into an array,
    2, Compute the prefix-sum of the array
  2. Each thread use binary search to find its work

The whole code

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define BLOCK_SIZE  32

__device__ int test(int pSum) {
    volatile __shared__ int prefixsum[BLOCK_SIZE];
    int count = 0;

    prefixsum[threadIdx.x] = pSum;

    while(__syncthreads_or(pSum > 0)) {
        int min = 0;
        int max = BLOCK_SIZE-1;
        while(min < max) {
            int mid = (min+max)/2;

            if(prefixsum[mid] <= threadIdx.x)
                min = mid+1;
            else
                max = mid;
        }
        __syncthreads();

        if(prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0)) {
            count++;

            if(pSum > 0)
                pSum -= BLOCK_SIZE;

            prefixsum[threadIdx.x] = pSum;
            __syncthreads();
        }
        else {
            if(pSum >= BLOCK_SIZE)
                pSum -= BLOCK_SIZE;
            else
                pSum = 0;

            prefixsum[threadIdx.x] = pSum;
            __syncthreads();
        }
    }

    return count;
}

__device__ int D_Prefixsum[BLOCK_SIZE];
__device__ int D_Counts[BLOCK_SIZE];

__global__ void gpuMain() {
    D_Counts[threadIdx.x] = test(D_Prefixsum[threadIdx.x]);
}

void checkErr(cudaError_t err) {
    if(err != cudaSuccess) {
        printf("cuda error: %s\n", cudaGetErrorString(err));
        exit(0);
    }
}

int H_Prefixsum[BLOCK_SIZE];
int H_Counts[BLOCK_SIZE];

int main(int argc, char *argv[]) {
    srand(time(NULL));

    int sum = 0;
    for(int i=0; i<BLOCK_SIZE; i++) {
        sum += (rand()%(BLOCK_SIZE*5)); 
        H_Prefixsum[i ] = sum;
    }

    checkErr(cudaMemcpyToSymbol(D_Prefixsum, H_Prefixsum, sizeof(int)*BLOCK_SIZE));
    cudaDeviceSynchronize();

    gpuMain<<<1, BLOCK_SIZE>>>();
    cudaDeviceSynchronize();

    checkErr(cudaMemcpyFromSymbol(H_Counts, D_Counts, sizeof(int)*BLOCK_SIZE));
    cudaDeviceSynchronize();

    for(int i=0; i<BLOCK_SIZE; i++) {
        printf("%d ", H_Counts[i ]);
    }
    printf("\n");

    return 0;
}

Each thread will call test(), and pSum is the corresponding prefix-sum entry.
The binary search is at line 14~24, the search result is in variable min, which I will call the target entry
After the binary search, each thread will update its prefix-sum entry (to delete the finished work)

The problem is, at line 26 the control flow should not diverge
Because if the first nonzero entry of prefixsum is large enough, then all thread must have the same target entry.
So count should be the same among all threads.
However, when I try the code on a GTX750Ti, I got different count among threads
The CUDA version is 6.5

I also tried on a GTX770, and the code runs as expected (the same count among threads)
Moreover, after removing the __syncthreads() at line 33 and 42, I got correct result on GTX750Ti

So is this a bug or I misuse something?

This post won’t actually address your question but I’m curious, can you organize your work by buckets? Because I think I know of an algorithmic improvement for building the array and accessing each bucket’s contents (doesn’t need a binary search algorithm).

@mnnuahg in general you should not rely on the synchronization/divergence behavior of a warp.

It is highly dependent on compiler optimizations and can actually get quite complex for programs with seemingly simple control flow. If you really want to see how your code is getting synchronized you need to look at the SASS, and even then you should not depend on it because a change to an unrelated part of the program (or a new compiler version) may cause the compiler to optimize your code differently, generating different synchronization behavior.

This topic causes quite a lot of confusion because it looks like there is only one possible synchronization point for many programs, and both the compiler and the programmer agree on the synchronization point. This is confusing because it is only true for some programs. It is false in general, and probably false for your program.

if(prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0))

This type of control flow statement is called a short-circuited branch, and is one (of many) types of control flow that create the possibility for multiple synchronization points. In these cases the compiler must pick one, and this may be a problem if it picks a different one than the programmer.

It isn’t too hard to show that C/C++ programs can have multiple possible choices for synchronization points, that it is impossible to pick ‘the best one’ statically (at compile time) without a solution to the halting problem, that is it also impossible to pick ‘the best one’ dynamically (at run time) without a perfect oracle predictor of each thread’s control flow, and even if you had the oracle, computing the solution would be intractable (in NP-Complete). Finally, even if the compiler/hardware could determine the ‘best point’ and always choose it (rather than choosing an arbitrary point), it would be equally difficult for the programmer to determine where warp synchronization would occur, and use this information to make assumptions about whether a warp is diverged or not.

So it is always better to write code that is correct regardless of whether or not a warp is diverged.

I am curious: Why have you placed the __syncthreads(); inside the if and else-blocks instead of after the conditional code?

@MutantJohn
Do you mean equal sized buckets? In my situation this may be hard.
If the buckets can have unequal size, then I’m very interesting whether there is a good balancing algorithm.

@Gregory Diamos
Thanks for your replying. So you mean putting __syncthreads() inside conditional block is a bad idea, even if the conditional evaluates identically for all threads?
If this is true, then I think NVIDIA should put it in the programming guide. But in the programming guide, I only see: “__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block”
And for the undecidability and intractability of choosing synchronization points, could you give me some references? I’m interesting in this topic.

@tera
Actually the __syncthreads() inside if else blocks are unnecessary, they are just used to trigger the “bug”

The __syncthreads() inside if else blocks are not unnecessary, they are wrong. This is what Gregory tried to point out.

This is a bit of a gray area in the programming guide unfortunately.

@tera
So you mean the sentence “__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block” in the programming guide is wrong, or, only correct in some cases?

And is there any workaround?
For example, if I modify the if statement

if(prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0))

into

bool pred = prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0);
__syncthreads();
if(pred) {
    ....
}

will this solve the problem?

That’s not quite what I mean. There are a number of corner cases here so it is hard to phrase the requirement exactly right, and the programming guide makes a decent attempt at it. If you read the programming guide as meaning that if the conditional evaluates identically for all threads leads, then there will be no divergence, and therefore there will be no ambiguity about where the warp will be converged, then it is correct. My statement was about what happens if there is divergence.

As far as I know, there is no single good reference. There are a number of publications and patents from the 80s that describe the theory behind it. I’ve been considering writing a document that describes some of these issues and includes proofs of some of the statements above, but I probably won’t have time for it for a while. The intuition behind the intractability result is that you can reduce the problem of finding the best reconvergence point to multi-sequence-alignment among the paths that threads in a warp will take. The intuition behind the undecidability result is that i) the best point will be different if threads take different paths through the program, and ii) the hardware/compiler cannot know those paths a-priori.

There is a subset of all programs for which the best reconvergence point is statically determinable at compile time. However, it is often hard to determine if your program falls into this category because the rules are complex with several corner cases, and they need to be applied after the compiler optimizes your program (so you need to be aware of the ways that compiler optimizations can change your program). So the programming guide is probably right to just assert that programs should not use __syncthreads inside any divergent control flow block.

There is. All they do is use an array to store the starting location of each bucket.

If you have a contiguous block of bucket contents (sorted by by bucket ID), you can easily look up where each bucket starts (using bucket ID so it’d be bucket_starts[bucket_id] = where bucket #whatever starts in bucket_contents) and use that to find the length of each bucket, i.e. you use the difference in starting locations between adjacent buckets.

The challenge is just building the bucket_starts array.

Here’s a paper on it : http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf

Edit: It’s in the Chaining section, I think. Sorry, the paper covers some irrelevant stuff. But it’s chained linked list implementation they bring to the GPU.

Hi all, sorry for the late reply

@Gregory Diamos
I found you have many publications related to this topic.
I read this one: “SIMD Re-Convergence At Thread Frontiers”, and it states that irregular control flow graph may cause problems with barriers, because threads are not guaranteed to re-converge at the basic block containing barrier.
So I think I understand what the difficulty is, but I still hope that NVIDIA can solve the problem, either by providing exact restrictions when using barrier inside conditional blocks, or by applying compiler transformation on conditional blocks containing barrier.
Thanks for your replying anyway, and I’m looking forward to the document about the hardness of choosing synchronization points.

@MutantJohn
Thanks for providing this material, it is helpful.