CUDA double buffer (producer/consumer)

Hi all!

I have a situation where I receive data from a sensor which I want to process on the GPU. Since launching the kernel repeatedly has a lot of overhead, I instead want to keep the kernel running and feed it data in packages as I receive it. For this I have implemented a double buffering system, where the host fills a buffer, signals the kernel that the buffer is ready to be consumed and then begins filling the second buffer while the GPU works. The flags between the device and the host are just pinned memory which the device is able to get via DMA. The way I decide which buffer needs to be used is by incrementing a counter each cycle and then inspecting if the counter is odd or even. If the buffer is not ready (device side), a while loop keeps the thread busy while it waits. When the host has decided the job is done, a flag can be set to tell the threads to terminate.
However, the code I have no gives me a weird output. I’ve set up a trivial example of what I want to do below. Essentially all the threads are really doing for “work” is to put the current value of their iteration counter (the one that decides which buffer to use) into the output array. Since my host terminates the threads after filling the buffers 10 times, I would expect the output array to contain only 9’s at termination. However some array cells do and some contain 8’s or even 11’s…it’s very strange. I’m running all this on Ubuntu 16.04 using CUDA 9.0 on a GeForce GTX 1070 with max-q. So here is my code and the output I’m getting:

#include <stdio.h>

#define DATA_BUFFER_SIZE 32
#define EVENT_DIMENSION 4
#define NUM_TRANSFORMS 60

__device__ volatile int blockcounter1=0;
__device__ volatile int blockcounter2=0;

__global__ void doubleBuffer2D(int* buffer1, int* buffer2, volatile int* bufferstate1, volatile int* bufferstate2, int* outputList, volatile int* terminateFlag) {
	int transformID = blockIdx.y;
	int eventID = blockIdx.x*blockDim.x+threadIdx.x;
	int outIdx = transformID*DATA_BUFFER_SIZE+eventID;

	if(eventID<DATA_BUFFER_SIZE) {
		int iteration=0;
		while(*terminateFlag!=1) {
			volatile int * state = (iteration%2==0)? bufferstate1:bufferstate2;
			int * buffer = (iteration%2==0)? buffer1:buffer2;
			volatile int * blockcounter = (iteration%2==0)? &blockcounter1:&blockcounter2;

			while(*state==0);//Wait until the host releases the buffer
			//Do something ======
			outputList[outIdx]=iteration;
			//Finish doing something =====

			__syncthreads(); // wait for my block to finish
			if (!threadIdx.x) atomicAdd((int *)blockcounter, 1); //Mark my block as finished
			__threadfence(); //make sure everyone can see my atomicAdd before proceeding
			if(transformID==0 && eventID==0) { // I'm the master thread!
				while(*blockcounter<NUM_TRANSFORMS); //Wait for everyone to be done.
				*blockcounter=0;
				*state=0;//Release the buffer back to the host
			}
			iteration++;
		}
	}

}

int main() {
	printf("Run \n");

	int * data = (int*)malloc(DATA_BUFFER_SIZE*sizeof(int));

	int * d_outputList;
	int * d_buffer1;
	int * d_buffer2;
	volatile int * outputBuffer1_state; //when the state=0, the host is claiming the buffer, if state=1 the device is claiming the buffer, if state=-1 the buffer is released
	volatile int * outputBuffer2_state;
	volatile int * terminate_thread;
	int h_outputList[NUM_TRANSFORMS * DATA_BUFFER_SIZE * EVENT_DIMENSION];

	cudaMalloc(&d_outputList, sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE);
	cudaMalloc(&d_buffer1, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaMalloc(&d_buffer2, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaHostAlloc(&outputBuffer1_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&outputBuffer2_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&terminate_thread, sizeof(int), cudaHostAllocMapped);

	cudaStream_t streamk, streamc;
	cudaStreamCreate(&streamk);
	cudaStreamCreate(&streamc);

	*outputBuffer1_state = 0;
	*outputBuffer2_state = 0;
	*terminate_thread = -1;

	int blockwidth = DATA_BUFFER_SIZE;
	int blocks = DATA_BUFFER_SIZE;
	int gridWidth = (blocks+blockwidth-1)/blockwidth;
	dim3 gridDim(gridWidth, NUM_TRANSFORMS);
	printf("Launching kernel with grid=%d,%d, block=%d \n", gridWidth, NUM_TRANSFORMS, blocks);
	doubleBuffer2D<<<gridDim, blocks, 0, streamk>>>(d_buffer1, d_buffer2, outputBuffer1_state, outputBuffer2_state, d_outputList, terminate_thread);

	for (int i = 0; i < 10; i++) {
		printf("========== Iteration %d ==========\n", i);
		int * output_buffer = (i % 2 == 0) ? d_buffer1 : d_buffer2;
		volatile int * buffer_state = (i % 2 == 0) ? outputBuffer1_state : outputBuffer2_state;
		if(i%2==0)printf("Buffer=buffer1, state=%d \n", *buffer_state);
		if(i%2==1)printf("Buffer=buffer2, state=%d \n", *buffer_state);
		printf(" Waiting for release...");
		while (*buffer_state == 1);// printf(" Waiting for release..."); //wait for the device to release the buffer
		printf("\n Released! Buffer state = %d",*buffer_state);
		printf("\n Copying memory...");
		cudaMemcpyAsync(output_buffer, &data[0], sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION, cudaMemcpyHostToDevice, streamc);
		printf("\n Waiting for copy...");
		cudaStreamSynchronize(streamc); //Wait for the copy to be done
		*buffer_state = 1; //Release the buffer to the device
		printf("\n Releasing Buffer...\n Buffer state = %d \n",*buffer_state);
		printf("val i=%d \n",i);
	}
	*terminate_thread = 1; //release the device
	cudaStreamSynchronize(streamk);
	cudaMemcpy(&h_outputList[0], d_outputList, sizeof(int) * NUM_TRANSFORMS* DATA_BUFFER_SIZE, cudaMemcpyDeviceToHost);
	cudaDeviceSynchronize();

	for(int i=0; i<NUM_TRANSFORMS; i++) {
		for(int j=0; j<DATA_BUFFER_SIZE; j++) {
			int idx = i*DATA_BUFFER_SIZE+j;
			printf("%d, ",h_outputList[idx]);
		}
		printf("\n");
	}
	printf("\n");
	printf("Done \n");
	return 0;
}

output:

Run 
Launching kernel with grid=1,60, block=32 
========== Iteration 0 ==========
Buffer=buffer1, state=0 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=0 
========== Iteration 1 ==========
Buffer=buffer2, state=0 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=1 
========== Iteration 2 ==========
Buffer=buffer1, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=2 
========== Iteration 3 ==========
Buffer=buffer2, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=3 
========== Iteration 4 ==========
Buffer=buffer1, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=4 
========== Iteration 5 ==========
Buffer=buffer2, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=5 
========== Iteration 6 ==========
Buffer=buffer1, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=6 
========== Iteration 7 ==========
Buffer=buffer2, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=7 
========== Iteration 8 ==========
Buffer=buffer1, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=8 
========== Iteration 9 ==========
Buffer=buffer2, state=1 
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1 
val i=9 
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 

Done

Basically that large matrix at the bottom should contain only 9’s…well if anyone has spotted the mistake I’d be thrilled to find out what I’ve done wrong. Many thanks!
Timo

I think you may have a sense that all threads and blocks in the grid are executing in lockstep, or in a fashion that is relatively synchronized to each other. But I’m not sure your coding pattern actually enforces this.

Let’s take an example.

Thread zero in block (0,0) is responsible for resetting your blockcounter(s). Suppose this thread (or this block) happens to get low priority in scheduling.

The host starts out by filling the first buffer, and setting the state to 1.
It then immediately proceeds to filling the second buffer, and setting its state to 1. The host will then revert to the first buffer, and wait until it observes the state set back to zero.

The kernel blocks begin to process buffer 0. For whatever reason, block (0,0) takes a long time to execute. All the other blocks can work on buffer 0, incrementing the blockcounter, and then proceed to buffer 1. Since the host has already set the state for buffer 1 to 1, all the blocks can work on buffer 1. They do their work on buffer 1, and then revert to buffer 0. But block (0,0) is still not winning the scheduling game, so even though the blockcounter is high enough, the code to reset it hasn’t been executed yet. And the state variable is still 1. So all the other blocks that have raced ahead still see the state as 1, and can proceed with processing buffer 0 (again).

But this clearly isn’t what you intended. 60 blocks on a GTX 1070 are not all going to execute in lockstep. A proper parallel algorithm on a GPU acknowledges that blocks can execute in any order, at any rate, and makes sure that such race conditions are managed by the programmer, explicitly.

I’m sure there are multiple ways to fix this, but one approach that is available to CUDA programmers in CUDA 9 is the grid-wide sync. What you are essentially wanting with this particular pattern is a grid-wide sync, that is no blocks are allowed to proceed with iteration 1 (or if you prefer, iteration 2) until all blocks have finished iteration 0, and there has been proper handshaking with the host on iteration 0. The grid-wide sync in CUDA 9 is part of the CG (cooperative groups) mechanism, and you can read about it in the programming guide, and there are various CG code samples, some of which demonstrate grid sync, such as the reduction…CG CUDA sample code.

I believe the above is a possible explanation for output data that exceeds 9. The other case is for output data that is less than 9, i.e. 8. In my case, I only observe this anomaly on the first block, which also happens to be the block responsible for advancing the buffer states. In your data, you witness it on multiple blocks. I haven’t thought about this carefully, but my guess is you have a race condition between the host and device as far as setting of the buffer state to 0, followed by the host setting the termination flag. If other blocks have raced ahead (as described above), I think it may be possible for block (0,0) to complete iteration 8, and since the host has also raced ahead, it sets the termination flag, which block (0,0) observes before actually working on iteration 9. The end result is that only the first block data remains at 8, in the output. An extension of this concept could possibly result in other blocks also not catching up to iteration 9 before the termination flag is set. Once the first mechanism described above is in play, then the forward racing of some blocks allows this second mechanism (early termination) to come into play, possibly for an arbitrary set of blocks.

Here’s a modification of your code that seems to give better results for me. I believe there are still hazards in this code, because there is an implicit requirement that all threadblocks be both resident and able to make forward progress, so that, for example, this condition can eventually be satisfied:

while(*blockcounter<NUM_TRANSFORMS); //Wait for everyone to be done.

If all threadblocks are not resident on SMs (perhaps because you are running with 60 threadblocks but on a “smaller” GPU), then this condition will result in a hang. But we can still tackle some learning on the other two races/hazards I mentioned.

To avoid the first race, I’ve built a slightly different interlock on the state variable. The state variable starts at zero, and then gets set by the host to the iteration number, to indicate “buffer ready”. The device code is working synchronously with the host, so it knows what iteration number to expect, and it will wait for that number before proceeding. When finished, it sets the state variable back to zero. Therefore if a block “races ahead” it won’t see the “go ahead” state of 1, it will see the previous iteration number, and it will wait. This eliminates (AFAICT) the possibility for numbers higher than 9 in the output.

The second condition can be trivially worked around by imposing enough of a delay from when the last buffer “go ahead” signal is given, to when the terminate flag is set. However this exposes another race. Some blocks will, after the 9th iteration, go on to expect a 10th iteration (they have not seen the terminate flag yet). Of course these blocks will now wait for the next buffer “go ahead” signal, but they will never get it, and hang here:

while(*state==0);//Wait until the host releases the buffer

To work around that, we provide an “early” exit possibility:

while(*state!=iteration) if (*terminateFlag == 1) return;//Wait until the host releases the buffer

here is a fully worked example, on linux, CUDA 9, and Tesla P100. As I mentioned it is not guaranteed to work correctly on all GPUs (and, like any other code I post it may also have other bugs that I simply have not witnessed). The proper design paradigm to work around this is to ensure that only as many blocks as the carrying capacity of the GPU are launched, and the aforementioned CG cooperative grid methods employ this.

$ cat t19.cu
#include <stdio.h>
#include <unistd.h>

#define DATA_BUFFER_SIZE 32
#define EVENT_DIMENSION 4
#define NUM_TRANSFORMS 60

__device__ volatile int blockcounter1=0;
__device__ volatile int blockcounter2=0;

__global__ void doubleBuffer2D(int* buffer1, int* buffer2, volatile int* bufferstate1, volatile int* bufferstate2, int* outputList, volatile int* terminateFlag) {
 int transformID = blockIdx.y;
 int eventID = blockIdx.x*blockDim.x+threadIdx.x;
 int outIdx = transformID*DATA_BUFFER_SIZE+eventID;

 if(eventID<DATA_BUFFER_SIZE) {
  int iteration=1;
  while(*terminateFlag!=1) {
   volatile int * state = (iteration%2==0)? bufferstate1:bufferstate2;
   int * buffer = (iteration%2==0)? buffer1:buffer2;
   volatile int * blockcounter = (iteration%2==0)? &blockcounter1:&blockcounter2;

   while(*state!=iteration) if (*terminateFlag == 1) return;//Wait until the host releases the buffer
   //Do something ======
   outputList[outIdx]=iteration;
   //Finish doing something =====

   __syncthreads(); // wait for my block to finish
   if (!threadIdx.x) atomicAdd((int *)blockcounter, 1); //Mark my block as finished
   __threadfence(); //make sure everyone can see my atomicAdd before proceeding
   if(transformID==0 && eventID==0) { // I'm the master thread!
    while(*blockcounter<NUM_TRANSFORMS); //Wait for everyone to be done.
    *blockcounter=0;
    *state=0;//Release the buffer back to the host
   }
   iteration++;
  }
 }

}

int main() {
 printf("Run \n");

 int * data = (int*)malloc(DATA_BUFFER_SIZE*sizeof(int));

 int * d_outputList;
 int * d_buffer1;
 int * d_buffer2;
 volatile int * outputBuffer1_state; //when the state=0, the host is claiming the buffer, if state=1 the device is claiming the buffer, if state=-1 the buffer is released
 volatile int * outputBuffer2_state;
 volatile int * terminate_thread;
 int h_outputList[NUM_TRANSFORMS * DATA_BUFFER_SIZE * EVENT_DIMENSION];

 cudaMalloc(&d_outputList, sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE);
 cudaMalloc(&d_buffer1, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
 cudaMalloc(&d_buffer2, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
 cudaHostAlloc(&outputBuffer1_state, sizeof(int), cudaHostAllocMapped);
 cudaHostAlloc(&outputBuffer2_state, sizeof(int), cudaHostAllocMapped);
 cudaHostAlloc(&terminate_thread, sizeof(int), cudaHostAllocMapped);

 cudaStream_t streamk, streamc;
 cudaStreamCreate(&streamk);
 cudaStreamCreate(&streamc);

 *outputBuffer1_state = 0;
 *outputBuffer2_state = 0;
 *terminate_thread = -1;

 int blockwidth = DATA_BUFFER_SIZE;
 int blocks = DATA_BUFFER_SIZE;
 int gridWidth = (blocks+blockwidth-1)/blockwidth;
 dim3 gridDim(gridWidth, NUM_TRANSFORMS);
 printf("Launching kernel with grid=%d,%d, block=%d \n", gridWidth, NUM_TRANSFORMS, blocks);
 doubleBuffer2D<<<gridDim, blocks, 0, streamk>>>(d_buffer1, d_buffer2, outputBuffer1_state, outputBuffer2_state, d_outputList, terminate_thread);

for (int i = 0; i < 10; i++) {
  printf("========== Iteration %d ==========\n", i);
  int * output_buffer = (i % 2 == 0) ? d_buffer1 : d_buffer2;
  volatile int * buffer_state = (i % 2 == 0) ? outputBuffer1_state : outputBuffer2_state;
  if(i%2==0)printf("Buffer=buffer1, state=%d \n", *buffer_state);
  if(i%2==1)printf("Buffer=buffer2, state=%d \n", *buffer_state);
  printf(" Waiting for release...");
  while (*buffer_state != 0);// printf(" Waiting for release..."); //wait for the device to release the buffer
  printf("\n Released! Buffer state = %d",*buffer_state);
  printf("\n Copying memory...");
  cudaMemcpyAsync(output_buffer, &data[0], sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION, cudaMemcpyHostToDevice, streamc);
  printf("\n Waiting for copy...");
  cudaStreamSynchronize(streamc); //Wait for the copy to be done
  *buffer_state = i; //Release the buffer to the device
  printf("\n Releasing Buffer...\n Buffer state = %d \n",*buffer_state);
  printf("val i=%d \n",i);
 }
        sleep(2);
 *terminate_thread = 1; //release the device
 cudaStreamSynchronize(streamk);
 cudaMemcpy(&h_outputList[0], d_outputList, sizeof(int) * NUM_TRANSFORMS* DATA_BUFFER_SIZE, cudaMemcpyDeviceToHost);
 cudaDeviceSynchronize();

 for(int i=0; i<NUM_TRANSFORMS; i++) {
  for(int j=0; j<DATA_BUFFER_SIZE; j++) {
   int idx = i*DATA_BUFFER_SIZE+j;
   printf("%d, ",h_outputList[idx]);
  }
  printf("\n");
 }
 printf("\n");
 printf("Done \n");
 return 0;
}
$ nvcc -arch=sm_60 -o  t19 t19.cu
t19.cu(20): warning: variable "buffer" was declared but never referenced

t19.cu(20): warning: variable "buffer" was declared but never referenced

$ ./t19
Run
Launching kernel with grid=1,60, block=32
========== Iteration 0 ==========
Buffer=buffer1, state=0
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 0
val i=0
========== Iteration 1 ==========
Buffer=buffer2, state=0
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 1
val i=1
========== Iteration 2 ==========
Buffer=buffer1, state=0
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 2
val i=2
========== Iteration 3 ==========
Buffer=buffer2, state=0
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 3
val i=3
========== Iteration 4 ==========
Buffer=buffer1, state=2
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 4
val i=4
========== Iteration 5 ==========
Buffer=buffer2, state=3
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 5
val i=5
========== Iteration 6 ==========
Buffer=buffer1, state=4
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 6
val i=6
========== Iteration 7 ==========
Buffer=buffer2, state=5
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 7
val i=7
========== Iteration 8 ==========
Buffer=buffer1, state=6
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 8
val i=8
========== Iteration 9 ==========
Buffer=buffer2, state=0
 Waiting for release...
 Released! Buffer state = 0
 Copying memory...
 Waiting for copy...
 Releasing Buffer...
 Buffer state = 9
val i=9
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,

Done
$

As an additional “aside”, it’s always good to test your code with cuda-memcheck anyway. But in this scenario, cuda-memcheck provides additional block-level scheduling stress, so it may be useful to expose block-level races (although not guaranteed to do so.) cuda-memcheck affects block scheduling behavior so as to instrument the GPU for additional error checking. Running your original code under cuda-memcheck will result in much wider deviations above 9, because of the effect it has on block scheduling order. So its a good additional/easy test in this case. But not guaranteed to be a conclusive statement that there are no block-level race conditions.

Hi @txbob,

First of all, my heartfelt thanks for all the effort you clearly put into explaining this for me. Your answer was clear, informative, helpful and above all helped me learn a lot about how to think about the way the GPU organises threads :)

So one thing I would change about the solution you posted - I don’t like the sleep() that we had to put in, but like you said, I guess that’s the trivial solution. So I’ve just put in another flag, “thread_terminated” like so:

#include <stdio.h>
#include <unistd.h>

#define DATA_BUFFER_SIZE 32
#define EVENT_DIMENSION 4
#define NUM_TRANSFORMS 60

__device__ volatile int blockcounter1 = 0;
__device__ volatile int blockcounter2 = 0;

__global__ void doubleBuffer2D(int* buffer1, int* buffer2,
		volatile int* bufferstate1, volatile int* bufferstate2, int* outputList,
		volatile int* terminateFlag, volatile int* thread_terminated) {
	int transformID = blockIdx.y;
	int eventID = blockIdx.x * blockDim.x + threadIdx.x;
	int outIdx = transformID * DATA_BUFFER_SIZE + eventID;

	if (eventID < DATA_BUFFER_SIZE) {
		int iteration = 1;
		while (*terminateFlag != 1) {
			volatile int * state =
					(iteration % 2 == 0) ? bufferstate1 : bufferstate2;
			int * buffer = (iteration % 2 == 0) ? buffer1 : buffer2;
			volatile int * blockcounter =
					(iteration % 2 == 0) ? &blockcounter1 : &blockcounter2;

			while (*state != iteration)
				if (*terminateFlag == 1)
					return; //Wait until the host releases the buffer
			//Do something ======
			outputList[outIdx] = iteration;
			//Finish doing something =====

			__syncthreads(); // wait for my block to finish
			if (!threadIdx.x)
				atomicAdd((int *) blockcounter, 1); //Mark my block as finished
			__threadfence(); //make sure everyone can see my atomicAdd before proceeding
			if (transformID == 0 && eventID == 0) { // I'm the master thread!
				while (*blockcounter < NUM_TRANSFORMS)
					; //Wait for everyone to be done.
				*thread_terminated=iteration;
				*blockcounter = 0;
				*state = 0; //Release the buffer back to the host
			}
			iteration++;
		}
	}

}

int main() {
	printf("Run \n");

	int * data = (int*) malloc(DATA_BUFFER_SIZE * sizeof(int));

	int * d_outputList;
	int * d_buffer1;
	int * d_buffer2;
	volatile int * outputBuffer1_state; //when the state=0, the host is claiming the buffer, if state=1 the device is claiming the buffer, if state=-1 the buffer is released
	volatile int * outputBuffer2_state;
	volatile int * terminate_thread;
	volatile int * thread_terminated;
	int h_outputList[NUM_TRANSFORMS * DATA_BUFFER_SIZE * EVENT_DIMENSION];

	cudaMalloc(&d_outputList, sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE);
	cudaMalloc(&d_buffer1, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaMalloc(&d_buffer2, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaHostAlloc(&outputBuffer1_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&outputBuffer2_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&terminate_thread, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&thread_terminated, sizeof(int), cudaHostAllocMapped);

	cudaStream_t streamk, streamc;
	cudaStreamCreate(&streamk);
	cudaStreamCreate(&streamc);

	*outputBuffer1_state = 0;
	*outputBuffer2_state = 0;
	*terminate_thread = -1;
	*thread_terminated=0;

	int blockwidth = DATA_BUFFER_SIZE;
	int blocks = DATA_BUFFER_SIZE;
	int gridWidth = (blocks + blockwidth - 1) / blockwidth;
	dim3 gridDim(gridWidth, NUM_TRANSFORMS);
	printf("Launching kernel with grid=%d,%d, block=%d \n", gridWidth,
			NUM_TRANSFORMS, blocks);
	doubleBuffer2D<<<gridDim, blocks, 0, streamk>>>(d_buffer1, d_buffer2,
			outputBuffer1_state, outputBuffer2_state, d_outputList,
			terminate_thread, thread_terminated);
	int iteration=0;
	for (int i = 0; i < 10; i++) {
		printf("========== Iteration %d ==========\n", i);
		int * output_buffer = (i % 2 == 0) ? d_buffer1 : d_buffer2;
		volatile int * buffer_state =
				(i % 2 == 0) ? outputBuffer1_state : outputBuffer2_state;
		if (i % 2 == 0)
			printf("Buffer=buffer1, state=%d \n", *buffer_state);
		if (i % 2 == 1)
			printf("Buffer=buffer2, state=%d \n", *buffer_state);
		printf(" Waiting for release...");
		while (*buffer_state != 0); // printf(" Waiting for release..."); //wait for the device to release the buffer
		printf("\n Released! Buffer state = %d", *buffer_state);
		printf("\n Copying memory...");
		cudaMemcpyAsync(output_buffer, &data[0],
				sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION,
				cudaMemcpyHostToDevice, streamc);
		printf("\n Waiting for copy...");
		cudaStreamSynchronize(streamc); //Wait for the copy to be done
		*buffer_state = i; //Release the buffer to the device
		printf("\n Releasing Buffer...\n Buffer state = %d \n", *buffer_state);
		printf("val i=%d \n", i);
		iteration=i;
	}
	while(*thread_terminated!=iteration);
	*terminate_thread = 1; //release the device
	cudaStreamSynchronize(streamk);
	cudaMemcpy(&h_outputList[0], d_outputList,
			sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE,
			cudaMemcpyDeviceToHost);
	cudaDeviceSynchronize();

	for (int i = 0; i < NUM_TRANSFORMS; i++) {
		for (int j = 0; j < DATA_BUFFER_SIZE; j++) {
			int idx = i * DATA_BUFFER_SIZE + j;
			printf("%d, ", h_outputList[idx]);
		}
		printf("\n");
	}
	printf("\n");
	printf("Done \n");
	return 0;
}

This just waits for the master thread to complete the tasks, which it cannot do unless all the other threads have too, so I guess now they are guaranteed to all terminate after 9 cycles.

Anyways, I’ll look into implementing this in a more structured way and read into cooperative groups as you suggested. As an aside, since you have clearly seen a lot of CUDA solutions for various problems, how valid do you feel this form of solution for the kind of problem I described in the original post? I mean as compared to simply re-launching a kernel each time I receive a new package of data, or something different which I might not even have considered. I’ve had a go at implementing different strategies and have found that on my particular set up this gives me the fastest processing time in terms of microseconds-per-datum.

Again, many thanks for your time and thoughtful answers.

Change away. My goal is to identify/explain what is wrong and illuminate one or more possible ways forward. My hope is that with that level of description, you can make forward progress. My expectation is that with a reasonable effort, and the aforementioned illumination, you can construct a much better solution than I. Using sleep is admittedly a pig. I like your crisp interlock better. I mostly wanted to prove the validity of a claim (about a race/hazard), not try and suggest that every line of code I write is the best possible approach. I’m neither that good, that motivated, nor that arrogant (I hope).

Regarding how I feel about this form of solution:

  1. It’s in a category of coding which I refer to as persistent kernel coding. Producer/consumer is an equally valid term. As you’ve seen, this sort of coding rises above “trivial” CUDA familiarity, and so I think it is harder than the most trivial cases to get “right”. But you can certainly do it correctly, and CG provides a nice framework to do it correctly, in a platform-aware approach. When global barriers are involved, I’m not 100% certain you can do it correctly without CG (*).

  2. What approach is better? I don’t know - for that you can benchmark alternative approaches. It might be that this is faster, or it might be that an ordinary kernel launch loop, in spite of the overheads, is actually better/faster. If they were equivalent, to avoid the aforementioned hazards and complexity, for maintainability, I would choose a kernel-launch loop. But if a persistent kernel method benchmarks out to be significantly faster (as you have already said), and you want that extra dose of performance, well that’s why we have engineers and computer scientists.

  3. @njuffa is a much more experienced programmer than I, so when it comes to programming philosophy (or nearly anything about programming), I attach more weight to his opinions than my own. There are a lot of other smart cookies that visit these forms occasionally too, so one of them may weigh in.

(*) http://on-demand.gputechconf.com/gtc/2016/presentation/s6673-greg-diamos-persisten-rnns.pdf excerpt: “Disclaimer: global barriers violate the CUDA 7.5 model.” (Greg Diamos == smart cookie)

Don’t hide your light under a bushel; our respective experience and strengths are in different areas, with some overlap. I know next to nothing about synchronization strategies (as discussed here) or multi-GPU, for example. While I have depth in certain areas, I am not much into programming philosophy. What “wisdom” I dispense here is based on having seen some stuff work and other stuff fail (as Sheryl Sandberg said: “All advice is autobiographical”). But some of those observations happened in the distant past and not all of them may be relevant today.

Hi all!

I’ve made the above code work with cooperative groups:

#include <stdio.h>
#include <cooperative_groups.h>
using namespace cooperative_groups;

#define DATA_BUFFER_SIZE 256
#define BLOCK_WIDTH 256
#define EVENT_DIMENSION 4
#define NUM_TRANSFORMS 60

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__device__ volatile int blockcounter1 = 0;
__device__ volatile int blockcounter2 = 0;

__global__ void doubleBuffer2D(int* buffer1, int* buffer2, volatile int* bufferstate1, volatile int* bufferstate2, int* outputList, volatile int* terminateFlag, volatile int* thread_terminated) {
	int transformID = blockIdx.y;
	int eventID = blockIdx.x * blockDim.x + threadIdx.x;
	int outIdx = transformID * DATA_BUFFER_SIZE + eventID;

	if (eventID < DATA_BUFFER_SIZE) {
		int iteration = 1;
		grid_group grid = this_grid();
		while (*terminateFlag != 1) {
			volatile int * state = (iteration % 2 == 0) ? bufferstate1 : bufferstate2;
			int * buffer = (iteration % 2 == 0) ? buffer1 : buffer2;

			//Do something ======
			outputList[outIdx] = iteration;
			//Finish doing something =====

			grid.sync();
			*state = 0;
			*thread_terminated = iteration;
			iteration++;
		}
	}

}

int main() {
	printf("Run \n");

	int * data = (int*) malloc(DATA_BUFFER_SIZE * sizeof(int));
	for (int i = 0; i < DATA_BUFFER_SIZE; i++) {
		data[i] = i;
	}

	int * d_outputList;
	int * d_buffer1;
	int * d_buffer2;
	volatile int * outputBuffer1_state; //when the state=0, the host is claiming the buffer, if state=1 the device is claiming the buffer, if state=-1 the buffer is released
	volatile int * outputBuffer2_state;
	volatile int * terminate_thread;
	volatile int * thread_terminated;
	int h_outputList[NUM_TRANSFORMS * DATA_BUFFER_SIZE * EVENT_DIMENSION];

	cudaMalloc(&d_outputList, sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE);
	cudaMalloc(&d_buffer1, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaMalloc(&d_buffer2, (sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION));
	cudaHostAlloc(&outputBuffer1_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&outputBuffer2_state, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&terminate_thread, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&thread_terminated, sizeof(int), cudaHostAllocMapped);

	cudaStream_t streamk, streamc;
	cudaStreamCreate(&streamk);
	cudaStreamCreate(&streamc);

	*outputBuffer1_state = 0;
	*outputBuffer2_state = 0;
	*terminate_thread = -1;
	*thread_terminated = 0;

	dim3 blockDim(BLOCK_WIDTH);
	int gridWidth = (DATA_BUFFER_SIZE + BLOCK_WIDTH - 1) / BLOCK_WIDTH;
	dim3 gridDim(gridWidth, NUM_TRANSFORMS);
	printf("Launching kernel with grid=%d,%d, block=%d \n", gridWidth,
	NUM_TRANSFORMS, BLOCK_WIDTH);

	void * doubleBufferParams[] = {&d_buffer1, &d_buffer2, &outputBuffer1_state, &outputBuffer2_state, &d_outputList, &terminate_thread, &thread_terminated};
	cudaLaunchCooperativeKernel((void*) doubleBuffer2D, gridDim, blockDim, doubleBufferParams, 0, streamk);
	cudaCheckErrors("LaunchCoop failed \n");

	int iteration = 0;
	for (int i = 0; i < 10; i++) {
		printf("========== Iteration %d ==========\n", i);
		int * output_buffer = (i % 2 == 0) ? d_buffer1 : d_buffer2;
		volatile int * buffer_state =
				(i % 2 == 0) ? outputBuffer1_state : outputBuffer2_state;
		if (i % 2 == 0)
			printf("Buffer=buffer1, state=%d \n", *buffer_state);
		if (i % 2 == 1)
			printf("Buffer=buffer2, state=%d \n", *buffer_state);
		printf(" Waiting for release...");
		while (*buffer_state != 0); //wait for the device to release the buffer
		printf("\n Released! Buffer state = %d", *buffer_state);
		printf("\n Copying memory...");
		cudaMemcpyAsync(output_buffer, &data[0],
				sizeof(int) * DATA_BUFFER_SIZE * EVENT_DIMENSION,
				cudaMemcpyHostToDevice, streamc);
		printf("\n Waiting for copy...");
		cudaStreamSynchronize(streamc); //Wait for the copy to be done
		*buffer_state = i; //Release the buffer to the device
		printf("\n Releasing Buffer...\n Buffer state = %d \n", *buffer_state);
		printf("val i=%d \n", i);
		iteration = i;
	}
	while (*thread_terminated != iteration);
	*terminate_thread = 1; //release the device
	cudaStreamSynchronize(streamk);
	cudaDeviceSynchronize();
	cudaMemcpy(&h_outputList[0], d_outputList, sizeof(int) * NUM_TRANSFORMS * DATA_BUFFER_SIZE,	cudaMemcpyDeviceToHost);

	for (int i = 0; i < NUM_TRANSFORMS; i++) {
		for (int j = 0; j < DATA_BUFFER_SIZE; j++) {
			int idx = i * DATA_BUFFER_SIZE + j;
			printf("%d, ", h_outputList[idx]);
		}
		printf("\n");
	}
	printf("\n");
	printf("Done \n");

	free(data);
	cudaFree(d_outputList);
	cudaFree(d_buffer1);
	cudaFree(d_buffer2);
	return 0;
}

It has certainly simplified my kernel code a lot! However, I am now struggling with another issue; if I make (DATA_BUFFER_SIZEBLOCK_WIDTHNUM_TRANSFORMS)>4194304, the program fails with

Fatal error: LaunchCoop failed 
 (too many blocks in cooperative launch at ../src/CudaTutorial.cu:92)
*** FAILED - ABORTING

For example, this issue will happen with the configuration:

#define DATA_BUFFER_SIZE 256
#define BLOCK_WIDTH 256
#define NUM_TRANSFORMS 65

If I make NUM_TRANSFORMS 64, no problem…I guess this has something to do with not all of the blocks fitting on the Streaming Multiprocessors at once or something? Or in the words of @txbob: “The proper design paradigm to work around this is to ensure that only as many blocks as the carrying capacity of the GPU are launched, and the aforementioned CG cooperative grid methods employ this.” Where in my GPU statistics can I actually find out what the carrying capacity of my GeForce Gtx 1070 (max-q) is?

Also, here are my GPU “stats”:

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1070 with Max-Q Design"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
  GPU Max Clock rate:                            1266 MHz (1.27 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Study the cuda sample code I mentioned:

http://docs.nvidia.com/cuda/cuda-samples/index.html#new-features-in-cuda-toolkit-9-0

6_Advanced/reductionMultiBlockCG. Demonstrates single pass reduction using Multi Block Cooperative Groups.

It uses the CUDA occupancy API to determine the proper number of blocks to launch, to satisfy the requirements of the cooperative grid launch. Yes, if you attempt to launch too many blocks, then you have violated the requirements of the cooperative grid launch, and that is the basis for the error.