I am implementing some sort of work distribution mechanism.
I have a nested work list, like {{w1, w2}, {w3, w4, w5, w6}, {w7}, {w8, w9, w10}}
The work distribution mechanism goes as follows:
- Store the work amount in each sub-list into an array,
2, Compute the prefix-sum of the array - Each thread use binary search to find its work
The whole code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define BLOCK_SIZE 32
__device__ int test(int pSum) {
volatile __shared__ int prefixsum[BLOCK_SIZE];
int count = 0;
prefixsum[threadIdx.x] = pSum;
while(__syncthreads_or(pSum > 0)) {
int min = 0;
int max = BLOCK_SIZE-1;
while(min < max) {
int mid = (min+max)/2;
if(prefixsum[mid] <= threadIdx.x)
min = mid+1;
else
max = mid;
}
__syncthreads();
if(prefixsum[min]>=BLOCK_SIZE && (min==0||prefixsum[min-1]==0)) {
count++;
if(pSum > 0)
pSum -= BLOCK_SIZE;
prefixsum[threadIdx.x] = pSum;
__syncthreads();
}
else {
if(pSum >= BLOCK_SIZE)
pSum -= BLOCK_SIZE;
else
pSum = 0;
prefixsum[threadIdx.x] = pSum;
__syncthreads();
}
}
return count;
}
__device__ int D_Prefixsum[BLOCK_SIZE];
__device__ int D_Counts[BLOCK_SIZE];
__global__ void gpuMain() {
D_Counts[threadIdx.x] = test(D_Prefixsum[threadIdx.x]);
}
void checkErr(cudaError_t err) {
if(err != cudaSuccess) {
printf("cuda error: %s\n", cudaGetErrorString(err));
exit(0);
}
}
int H_Prefixsum[BLOCK_SIZE];
int H_Counts[BLOCK_SIZE];
int main(int argc, char *argv[]) {
srand(time(NULL));
int sum = 0;
for(int i=0; i<BLOCK_SIZE; i++) {
sum += (rand()%(BLOCK_SIZE*5));
H_Prefixsum[i ] = sum;
}
checkErr(cudaMemcpyToSymbol(D_Prefixsum, H_Prefixsum, sizeof(int)*BLOCK_SIZE));
cudaDeviceSynchronize();
gpuMain<<<1, BLOCK_SIZE>>>();
cudaDeviceSynchronize();
checkErr(cudaMemcpyFromSymbol(H_Counts, D_Counts, sizeof(int)*BLOCK_SIZE));
cudaDeviceSynchronize();
for(int i=0; i<BLOCK_SIZE; i++) {
printf("%d ", H_Counts[i ]);
}
printf("\n");
return 0;
}
Each thread will call test(), and pSum is the corresponding prefix-sum entry.
The binary search is at line 14~24, the search result is in variable min, which I will call the target entry
After the binary search, each thread will update its prefix-sum entry (to delete the finished work)
The problem is, at line 26 the control flow should not diverge
Because if the first nonzero entry of prefixsum is large enough, then all thread must have the same target entry.
So count should be the same among all threads.
However, when I try the code on a GTX750Ti, I got different count among threads
The CUDA version is 6.5
I also tried on a GTX770, and the code runs as expected (the same count among threads)
Moreover, after removing the __syncthreads() at line 33 and 42, I got correct result on GTX750Ti
So is this a bug or I misuse something?