can one force two operations to occur atomically together?

Greetings.

I’m currently writing a kernel which returns an unknown number of int results to a (worked out to be) sufficiently large array in global memory.

I’m currently trying to get the kernel to ensure each result is sent to a unique position in the array by reading in a global arraycounter value to a local variable, sending the thread’s current result to global_array[arraycounter] and then atomically increasing the arraycounter.

Obviously this doesn’t work as there are no guarantees that any of the threads are getting their values between the atomic increases of the counter.

So I’m left wondering, is there any way get a thread to receive the value of the variable it is atomically increasing before any other threads get a chance to atomically increment it again?

Use the return value of the atomicAdd function. It returns the value in the location that was there immediately prior to the atomic operation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

For example, suppose I have a global array like this:

__device__ int next_idx = 0;
__device__ float global_data[MAX_DSIZE];

Then in my kernel code, suppose a thread decides that it needs to put 3 items (float quantities) into the next available slot in global_data. It could do something like this:

int my_idx = atomicAdd(&next_idx, 3);
if ((my_idx + 2) < MAX_DSIZE){
  global_data[my_idx++] = my_local_float_0;
  global_data[my_idx++] = my_local_float_1;
  global_data[my_idx++] = my_local_float_2;}

Those 3 positions in global_data are guaranteed to be “reserved” for use by that thread. There is no possibility of anyone else stepping on it, if they use the same mechanism. There’s nothing magical about float here. You could change float global_data to int global_data and store int quantities using a similar mechanism. There’s also no particular reason the global buffer has to use device variables. You can use global variables allocated with cudaMalloc in a similar fashion. This just simplifies the presentation.

Thanks. I’m trying to check the difference between searching a 3D kd-tree on a CPU and a GPU. This makes finishing a range search on a GPU much easier.