Postfix expression evaluation on GPU

Hello, everyone

I am implementing a CUDA kernel to evaluate a postfix expression with my GTX Titan. This kernel takes the expression, the left series, and the right series of data as input. Then it evaluates the expression on every pair of left and right data element by calling to the device function below. Basically, this function allocates an array (that I called ‘stack’) and use it as a stack to store intermediate values. It iterates over expression nodes and checks the type of the node. If a node is a value type, then the function gets the real value from input data and push it to the stack. If a node is an operator, then it pops top values from the stack, evaluates the operator, and pushes back the result to the stack. The final result stays at stack[0].

typedef struct {
	ExpType type;	//type of expression node, could be operand, or specific operators such as add, subtract, multiply, etc
	int idx; 	//0 if left operand and 1 if right operand
	...
} ExpNode;

class Val {
public: 
	Val operator(Val &right){...};
private:
	int16_t value;
	ValType type;	//int8_t, int16_t, double, float, etc
}

__device__ Val evaluate(ExpNode *expression,
				int size,
				Val left_op,
				Val right_op)
{
	Val stack[MAX_STACK_SIZE];
	Val *stack_ptr = stack;
	Val left, right;

	for (int i = 0; i < size; i++) {
		switch (expression[i].type) {
			case OPERAND:
				if (expression[i].idx == 0)
					*stack_ptr++ = left_op;
				else
					*stack_ptr++ = right_op;
			case OPERATOR: {
				//Pop top values from the stack
				right = *--stack_ptr;
				left = *--stack_ptr;
				//Evaluate and push back result to the stack
				*stack_ptr++ = left.operator(right);
				break;
			}
			...
		}
	}

	return *--stack_ptr;
}

__global__ void EVAL(ExpNode *expression,
			int size,
			Val *left_series,
			Val *right_series,
			Val *output,
			...)
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	output[idx] = evaluate(expression, size, left_series[idx], right_series[idx]);
...
}

This is the problem: when I increase the size of stack (by changing MAX_STACK_SIZE), the execution time of this kernel increases, too. In my opinion, the execution time should be the same because I used the same expression and input data in all experiments.
Here are some numbers I measured from my previous experiments. msecs means milliseconds. As you can see, the execution time increases quickly when MAX_STACK_SIZE is above 32.

MAX_STACK_SIZE = 4: 137.5 msecs
     [exec] ptxas info    : Compiling entry function 'EVAL' for 'sm_52'
     [exec] ptxas info    : Function properties for EVAL
     [exec]     320 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
     [exec] ptxas info    : Used 47 registers, 448 bytes cmem[0], 64 bytes cmem[2]

MAX_STACK_SIZE = 8: 154.7 msecs
     [exec] ptxas info    : Compiling entry function 'EVAL' for 'sm_52'
     [exec] ptxas info    : Function properties for EVAL
     [exec]     384 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
     [exec] ptxas info    : Used 47 registers, 448 bytes cmem[0], 64 bytes cmem[2]

MAX_STACK_SIZE = 16: 143.9 msecs
     [exec] ptxas info    : Compiling entry function 'EVAL' for 'sm_52'
     [exec] ptxas info    : Function properties for EVAL
     [exec]     512 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
     [exec] ptxas info    : Used 47 registers, 448 bytes cmem[0], 64 bytes cmem[2]

MAX_STACK_SIZE = 32: 189.2 msecs
     [exec] ptxas info    : Compiling entry function 'EVAL' for 'sm_52'
     [exec] ptxas info    : Function properties for EVAL
     [exec]     768 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
     [exec] ptxas info    : Used 47 registers, 448 bytes cmem[0], 64 bytes cmem[2]

MAX_STACK_SIZE = 64: 285.4 msecs

MAX_STACK_SIZE = 128: 489.6 msecs    
     [exec] ptxas info    : Compiling entry function 'EVAL' for 'sm_52'
     [exec] ptxas info    : Function properties for EVAL
     [exec]     2304 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
     [exec] ptxas info    : Used 47 registers, 448 bytes cmem[0], 64 bytes cmem[2]

Do you have any idea why this situation happened? Any help or suggestions would be appreciated.

Perhaps there is some of your code that you haven’t shown that depends on MAX_STACK_SIZE.

The suggestion I have is that you provide a short, complete code that demonstrates the issue.

Alternatively, I don’t see any reason why you need to make this a local allocation. Perform a cudaMalloc before launching the kernel, and pass the base “stack” area to the kernel. Each thread can then carve out its own individual stack.

@txbob
Thanks for your suggestion. Here is my sample code. I have not tested it with GTX Titan, but for my GTX 970, the same problem occurred: when I increase MAX_STACK_SIZE, the execution time becomes longer.

main.cu

//main.cu
#include "value.h"
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>


__device__ Value evaluate(ExpNode *exp, int exp_size, Value left, Value right)
{
	Value stack[MAX_STACK_SIZE];
	Value *stack_ptr = stack;
	Value ltemp, rtemp;

	for (int i = 0; i < exp_size; i++) {
		switch (exp[i].type) {
		case EXP_VALUE: {
			if (exp[i].idx == 0) {
				*stack_ptr++ = left;
			} else {
				*stack_ptr++ = right;
			}
			break;
		}
		case EXP_AND: {
			ltemp = *--stack_ptr;
			rtemp = *--stack_ptr;
			*stack_ptr++ = ltemp.log_and(rtemp);
			break;
		}
		case EXP_OR: {
			ltemp = *--stack_ptr;
			rtemp = *--stack_ptr;
			*stack_ptr++ = ltemp.log_or(rtemp);
			break;
		}
		case EXP_EQUAL: {
			ltemp = *--stack_ptr;
			rtemp = *--stack_ptr;
			*stack_ptr++ = ltemp.equal(rtemp);
			break;
		}
		case EXP_GREATER: {
			ltemp = *--stack_ptr;
			rtemp = *--stack_ptr;
			*stack_ptr++ = ltemp.greater_than(rtemp);
			break;
		}
		case EXP_LESS: {
			ltemp = *--stack_ptr;
			rtemp = *--stack_ptr;
			*stack_ptr++ = ltemp.less_than(rtemp);
			break;
		}
		default:
			return Value::getFalse();
		}
	}

	return stack[0];
}

__global__ void exp_evaluate(ExpNode *exp,
				int exp_size,
				Value *left,
				Value *right,
				int *output,
				int *loop_num,
				int size)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int i = 0;
	Value result;
	int count = 0;

	while (i < loop_num[x]) {
		result = evaluate(exp, exp_size, left[x], right[x]);
		count += (result.isTrue()) ? 1 : 0;
		i++;
	}
	//__syncthreads();
	output[x] = count;
}

int main()
{
	Value *left, *right;
	int *loop_num;

	left = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
	if (left == NULL) {
		printf("Failed to allocate left array. \n");
		return -1;
	}
	right = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
	if (right == NULL) {
		printf("Failed to allocate right array. \n");
		return -1;
	}
	loop_num = (int *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
	if (loop_num == NULL) {
		printf("Failed to allocate loop_num array. \n");
		return -1;
	}

	srand(time(NULL));


	//Manually set values for left and right arrays
	for (int i = 0; i < MAX_ARRAY_SIZE; i++) {
		left[i].setData(1, TYPE_INTEGER);
		right[i].setData(1, TYPE_INTEGER);
		loop_num[i] = rand() % 20 + 1;
	}

	cudaError_t res;
	Value *cu_left, *cu_right;
	int *cu_loop_num, *cu_output, output[MAX_ARRAY_SIZE];

	checkCudaErrors(cudaMalloc(&cu_left, MAX_ARRAY_SIZE * sizeof(Value)));
	checkCudaErrors(cudaMalloc(&cu_right, MAX_ARRAY_SIZE * sizeof(Value)));
	checkCudaErrors(cudaMalloc(&cu_loop_num, MAX_ARRAY_SIZE * sizeof(Value)));
	checkCudaErrors(cudaMalloc(&cu_output, MAX_ARRAY_SIZE * sizeof(int)));

	checkCudaErrors(cudaMemcpy(cu_left, left, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(cu_right, right, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(cu_loop_num, loop_num, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));

	ExpNode expression[11];
	ExpNode *cu_expression;

	//Manually create post-fix expression
	expression[0].idx = 0;
	expression[0].type = EXP_VALUE;
	expression[1].idx = 1;
	expression[1].type = EXP_VALUE;
	expression[2].type = EXP_EQUAL;
	expression[3].idx = 0;
	expression[3].type = EXP_VALUE;
	expression[4].idx = 1;
	expression[4].type = EXP_VALUE;
	expression[5].type = EXP_GREATER;
	expression[6].type = EXP_OR;
	expression[7].idx = 0;
	expression[7].type = EXP_VALUE;
	expression[8].idx = 1;
	expression[8].type = EXP_VALUE;
	expression[9].type = EXP_LESS;
	expression[10].type = EXP_OR;

	checkCudaErrors(cudaMalloc(&cu_expression, 11 * sizeof(ExpNode)));
	checkCudaErrors(cudaMemcpy(cu_expression, expression, 11 * sizeof(ExpNode), cudaMemcpyHostToDevice));

	dim3 grid_size(1024, 1, 1);
	dim3 block_size(1024, 1, 1);

	struct timeval start, end;

	gettimeofday(&start, NULL);
	exp_evaluate<<<grid_size, block_size>>>(cu_expression, 11, cu_left, cu_right, cu_output, cu_loop_num, MAX_STACK_SIZE);
	res = cudaGetLastError();
	if (res != cudaSuccess) {
		printf("Error: kernel launch failed. Error code: %s\n", cudaGetErrorString(res));
		return -1;
	}

	checkCudaErrors(cudaDeviceSynchronize());
	gettimeofday(&end, NULL);

	printf("Execution time: %lu usec\n", (end.tv_sec - start.tv_sec) * 1000000 + end.tv_usec - start.tv_usec);


	checkCudaErrors(cudaMemcpy(output, cu_output, MAX_ARRAY_SIZE * sizeof(int), cudaMemcpyDeviceToHost));

	checkCudaErrors(cudaFree(cu_left));
	checkCudaErrors(cudaFree(cu_right));
	checkCudaErrors(cudaFree(cu_output));
	checkCudaErrors(cudaFree(cu_loop_num));
	checkCudaErrors(cudaFree(cu_expression));
	free(left);
	free(right);
	free(loop_num);
	printf("End\n");

	return 0;
}

value.h

#include <iostream>
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>
#include <helper_functions.h>

#define MAX_STACK_SIZE 4
#define MAX_ARRAY_SIZE (1024 * 1024)

typedef enum {
	EXP_VALUE = 0,
	EXP_AND,
	EXP_OR,
	EXP_EQUAL,
	EXP_GREATER,
	EXP_LESS
} ExpType;

typedef struct {
	ExpType type;
	int idx;		//0: left array, 1: right array
} ExpNode;

typedef enum {
    TYPE_INVALID      = 0,
    TYPE_NULL,
    TYPE_BOOLEAN,
    TYPE_INTEGER,
    TYPE_DOUBLE
} ValType;

class Value {
public:
	__host__ __device__ Value(): value_(0), type_(TYPE_NULL) {}
	__host__ __device__ Value(int64_t val, ValType val_type): value_(val), type_(val_type) {}

	inline __device__ Value equal(Value &h) {
		int64_t h_val = h.getValue();
		return (value_ == h_val) ? getTrue() : getFalse();
	}

	inline __device__ Value greater_than(Value &h) {
		int64_t h_val = h.getValue();
		return (value_ > h_val) ? getTrue() : getFalse();
	}

	inline __device__ Value less_than(Value &h) {
		int64_t h_val = h.getValue();
		return (value_ < h_val) ? getTrue() : getFalse();
	}

	inline __device__ Value log_and(Value &h) {
		bool h_val = (bool)(h.getValue());
		return ((bool)value_ && h_val) ? getTrue() : getFalse();
	}

	inline __device__ Value log_or(Value &h) {
		bool h_val = (bool)(h.getValue());
		return ((bool)value_ || h_val) ? getTrue() : getFalse();
	}

	static __device__ Value getTrue() {
		return Value((int64_t)true, TYPE_BOOLEAN);
	}

	static __device__ Value getFalse() {
		return Value((int64_t)false, TYPE_BOOLEAN);
	}

	inline __host__ __device__ void setData(int64_t val, ValType val_type) {
		value_ = val;
		type_ = val_type;
	}

	__device__ bool isTrue() {
		return ((bool)value_);
	}

	__device__ bool isFalse() {
		return !isTrue();
	}

	__device__ int64_t getValue() {
		return value_;
	}

	__device__ ValType getValueType() {
		return type_;
	}
private:
	int64_t value_;
	ValType type_;
};

When I used this code for GTX 970, the execution time was around 664 msecs when MAX_STACK_SIZE was 4 and increased to 2937 msecs when MAX_STACK_SIZE was 256.
Btw, I also tried to allocate ‘stack’ array with cudaMalloc, but the result was still the same.

Interesting.

The metrics I would focus on are:

$ nvcc -arch=sm_52 -Xptxas -v -DMAX_STACK_SIZE=4 -I/usr/local/cuda/samples/common/inc -o test main.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i' for 'sm_52'
ptxas info    : Function properties for _Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 26 registers, 372 bytes cmem[0], 32 bytes cmem[2]
$ nvprof --metrics local_store_transactions,local_load_transactions,dram_read_transactions,dram_write_transactions,l2_utilization ./test
==2660== NVPROF is profiling process 2660, command: ./test
==2660== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==2660== Replaying kernel "exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (done)
Execution time: 132677 usec
End
==2660== Profiling application: ./test
==2660== Profiling result:
==2660== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 960 (0)"
        Kernel: exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)
          1                  local_store_transactions                  Local Store Transactions   107656379   107656379   107656379
          1                   local_load_transactions                   Local Load Transactions    95126884    95126884    95126884
          1                    dram_read_transactions           Device Memory Read Transactions     1319005     1319005     1319005
          1                   dram_write_transactions          Device Memory Write Transactions    49669178    49669178    49669178
          1                            l2_utilization                      L2 Cache Utilization    Max (10)    Max (10)    Max (10)
$ nvcc -arch=sm_52 -Xptxas -v -DMAX_STACK_SIZE=256 -I/usr/local/cuda/samples/common/inc -o test main.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i' for 'sm_52'
ptxas info    : Function properties for _Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i
    4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 372 bytes cmem[0], 24 bytes cmem[2]
$ nvprof --metrics local_store_transactions,local_load_transactions,dram_read_transactions,dram_write_transactions,l2_utilization ./test
==2718== NVPROF is profiling process 2718, command: ./test
==2718== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Execution time: 1908386 usecxp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (0 of 2)...
==2718== Replaying kernel "exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (done)
End
==2718== Profiling application: ./test
==2718== Profiling result:
==2718== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX 960 (0)"
        Kernel: exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)
          1                  local_store_transactions                  Local Store Transactions  1915478518  1915478518  1915478518
          1                   local_load_transactions                   Local Load Transactions    95062058    95062058    95062058
          1                    dram_read_transactions           Device Memory Read Transactions    34390085    34390085    34390085
          1                   dram_write_transactions          Device Memory Write Transactions  1911826037  1911826037  1911826037
          1                            l2_utilization                      L2 Cache Utilization     Mid (5)     Mid (5)     Mid (5)
$

We see that in the MAX_STACK_SIZE = 256 case, the local store transactions increase by 20x, the local load transactions remain about the same, the dram read transactions increase by 25x, and the dram write transactions increase by 40x.

Bucking the trend, the L2 utilization (effectively a measure of average L2 bandwidth) goes down in the 256 stack size case.

So what we have here is a local memory organization that in one case (4) is more L2-friendly, and in the other case (256) is less L2-friendly. The reason it is less L2 friendly is that it is spending more time in dram latency hell (i.e. this is why the L2 utilization metric oddly goes down).

We can get some clues about local memory organization from the programming guide. First, lets remind ourselves how local memory is organized:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

“Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs.”

now what’s meant by thread ID specifically? That refers back to section 2.2:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

“The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same;…”

So we see that organization by thread ID implies a block-level organization, not a grid-level organization.

Suppose I had a grid-level organization, with threadblocks of 4 threads each, and a stack size of 3:

TB0:     TB1:     TB2:
0 1 2 3  4 5 6 7  8 9 10 11 ...
0 1 2 3  4 5 6 7  8 9 10 11 ...
0 1 2 3  4 5 6 7  8 9 10 11 ...

But instead we have a block-level organization:

0 1 2 3
4 5 6 7
8 9 10 11

0 1 2 3
4 5 6 7
8 9 10 11

0 1 2 3
4 5 6 7
8 9 10 11

...

The takeaway is that in the first case, the size of the stack (the vertical dimension) should not matter - elements at the top of each stack are still grouped together in memory, followed by the next element in each stack – throughout the grid. This is true regardless of stack size.

But in the second case, this is not true. If we increase the stack (the vertical dimension) elements at the top of the stack in TB0 get farther away from elements at the top of the stack in TB1.

This then has implications for the L2 cache, where local loads and stores will attempt to hit before they escape to dram.

The size of your stack element is 16 bytes. So a stack size of 4 means 64 bytes per thread, or 64KB per 1024-thread threadblock.

For an L2 cache of ~1792KB on GTX 970, that means that with a stack size of 4, 1792/64 or about 28 threadblocks (out of 1024 total) worth of stack frames could fit in L2 cache at any time.

If we’re only using that much of the larger stack, then with a fully associative cache, there shouldn’t be much difference. But the L2 cache is not fully associative, and you are introducing a fairly large step in the access pattern by doing this. So I think it’s possible that this effectively results in thrashing of the L2 cache (the metrics suggest this to me, anyway) which is resulting in the dramatic increase of total number of transactions to dram, in the slower 256 case.

This is still a pretty hand-wavy explanation, and it definitely does not address all possible questions (like why are the local store transactions so much higher while the local loads are not).

However, I think it should be possible to avoid the block vs. grid organization issue by using global memory for the stack, instead of local memory, since you can organize it anyway you want. You mention that you tried this, but you may not have organized it correctly. If you simply carve out a contiguous chunk for each thread, that won’t work - you’re back into the vertical vs. horizontal trap. I think you want to carve the global allocation in such a way that each thread’s stack is interleaved - which is effectively what the local memory organization does - but you will interleave across the grid instead of across the threadblock. A problem you will run into here is that for a large stack size, this will be a really large allocation - a stack size of 256 with your grid size (1048576 threads) requires an allocation of ~4GB. We can improve on this by determining max instantaneous occupancy (it is 2 threadblocks * number of GPU SMs for this particular code) and using an atomic technique to allow threadblocks to reuse the stack area of other threadblocks that have retired. This would reduce the global memory allocation required down to a very manageable 2 * #SMs * number of threads per block * stack size * 16 (i.e. sizeof(Value)).

Having said all that, this code suggests to me the potential for a lot of thread/divergent behavior, which may lead each thread to be accessing different relative elements in the stack. Depending on how this divergence works out in your “real” application, and the actual stack size, then none of this fiddling may matter. You may still end up thrashing the L2 with large scale disorganized accesses.

To demonstrate that global memory could be used to work around this via reorganization of the underlying data storage pattern, here’s a modification of your code that uses a single global memory allocation for the stack, and each thread has a strided/interleaved stack.

//main2.cu
#include <iostream>
#include <stdio.h>
#include <helper_cuda.h>
#include <helper_functions.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

//#define MAX_STACK_SIZE 4
#define nTPB 1024
#define nBLK 1024
#define MAX_ARRAY_SIZE (nTPB * nBLK)

typedef enum {
        EXP_VALUE = 0,
        EXP_AND,
        EXP_OR,
        EXP_EQUAL,
        EXP_GREATER,
        EXP_LESS
} ExpType;

typedef struct {
        ExpType type;
        int idx;                //0: left array, 1: right array
} ExpNode;

typedef enum {
    TYPE_INVALID      = 0,
    TYPE_NULL,
    TYPE_BOOLEAN,
    TYPE_INTEGER,
    TYPE_DOUBLE
} ValType;

class Value {
public:
        __host__ __device__ Value(): value_(0), type_(TYPE_NULL) {}
        __host__ __device__ Value(int64_t val, ValType val_type): value_(val), type_(val_type) {}

        inline __device__ Value equal(Value &h) {
                int64_t h_val = h.getValue();
                return (value_ == h_val) ? getTrue() : getFalse();
        }

        inline __device__ Value greater_than(Value &h) {
                int64_t h_val = h.getValue();
                return (value_ > h_val) ? getTrue() : getFalse();
        }

        inline __device__ Value less_than(Value &h) {
                int64_t h_val = h.getValue();
                return (value_ < h_val) ? getTrue() : getFalse();
        }

        inline __device__ Value log_and(Value &h) {
                bool h_val = (bool)(h.getValue());
                return ((bool)value_ && h_val) ? getTrue() : getFalse();
        }

        inline __device__ Value log_or(Value &h) {
                bool h_val = (bool)(h.getValue());
                return ((bool)value_ || h_val) ? getTrue() : getFalse();
        }

        static __device__ Value getTrue() {
                return Value((int64_t)true, TYPE_BOOLEAN);
        }

        static __device__ Value getFalse() {
                return Value((int64_t)false, TYPE_BOOLEAN);
        }

        inline __host__ __device__ void setData(int64_t val, ValType val_type) {
                value_ = val;
                type_ = val_type;
        }

        __device__ bool isTrue() {
                return ((bool)value_);
        }

        __device__ bool isFalse() {
                return !isTrue();
        }

        __device__ int64_t getValue() {
                return value_;
        }

        __device__ ValType getValueType() {
                return type_;
        }
private:
        int64_t value_;
        ValType type_;
};

__device__ Value evaluate(ExpNode *exp, int exp_size, Value left, Value right, Value *stack_ptr)
{
        Value ltemp, rtemp, *stack = stack_ptr;
        int incr = MAX_ARRAY_SIZE;
        for (int i = 0; i < exp_size; i++) {
                switch (exp[i].type) {
                case EXP_VALUE: {
                        if (exp[i].idx == 0) {
                                *stack_ptr = left; stack_ptr += incr;
                        } else {
                                *stack_ptr = right; stack_ptr += incr;
                        }
                        break;
                }
                case EXP_AND: {
                        stack_ptr -= incr; ltemp = *stack_ptr;
                        stack_ptr -= incr; rtemp = *stack_ptr;
                        *stack_ptr = ltemp.log_and(rtemp); stack_ptr += incr;
                        break;
                }
                case EXP_OR: {
                        stack_ptr -= incr; ltemp = *stack_ptr;
                        stack_ptr -= incr; rtemp = *stack_ptr;
                        *stack_ptr = ltemp.log_or(rtemp); stack_ptr += incr;
                        break;
                }
                case EXP_EQUAL: {
                        stack_ptr -= incr; ltemp = *stack_ptr;
                        stack_ptr -= incr; rtemp = *stack_ptr;
                        *stack_ptr = ltemp.equal(rtemp); stack_ptr += incr;
                        break;
                }
                case EXP_GREATER: {
                        stack_ptr -= incr; ltemp = *stack_ptr;
                        stack_ptr -= incr; rtemp = *stack_ptr;
                        *stack_ptr = ltemp.greater_than(rtemp); stack_ptr += incr;
                        break;
                }
                case EXP_LESS: {
                        stack_ptr -= incr; ltemp = *stack_ptr;
                        stack_ptr -= incr; rtemp = *stack_ptr;
                        *stack_ptr = ltemp.less_than(rtemp); stack_ptr += incr;
                        break;
                }
                default:
                        return Value::getFalse();
                }
        }

        return stack[0];
}

__global__ void exp_evaluate(ExpNode *exp,
                                int exp_size,
                                Value *left,
                                Value *right,
                                int *output,
                                int *loop_num,
                                Value *my_stack)
{
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int i = 0;
        Value result;
        int count = 0;

        while (i < loop_num[x]) {
                result = evaluate(exp, exp_size, left[x], right[x], my_stack+x);
                count += (result.isTrue()) ? 1 : 0;
                i++;
        }
        //__syncthreads();
        output[x] = count;
}

int main()
{
        Value *left, *right, *sp;
        int *loop_num;

        left = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
        if (left == NULL) {
                printf("Failed to allocate left array. \n");
                return -1;
        }
        right = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
        if (right == NULL) {
                printf("Failed to allocate right array. \n");
                return -1;
        }
        loop_num = (int *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
        if (loop_num == NULL) {
                printf("Failed to allocate loop_num array. \n");
                return -1;
        }

        srand(time(NULL));

//Manually set values for left and right arrays
        for (int i = 0; i < MAX_ARRAY_SIZE; i++) {
                left[i].setData(1, TYPE_INTEGER);
                right[i].setData(1, TYPE_INTEGER);
                loop_num[i] = rand() % 20 + 1;
        }

        cudaError_t res;
        Value *cu_left, *cu_right;
        int *cu_loop_num, *cu_output, output[MAX_ARRAY_SIZE];

        checkCudaErrors(cudaMalloc(&cu_left, MAX_ARRAY_SIZE * sizeof(Value)));
        checkCudaErrors(cudaMalloc(&cu_right, MAX_ARRAY_SIZE * sizeof(Value)));
        checkCudaErrors(cudaMalloc(&cu_loop_num, MAX_ARRAY_SIZE * sizeof(Value)));
        checkCudaErrors(cudaMalloc(&cu_output, MAX_ARRAY_SIZE * sizeof(int)));
        checkCudaErrors(cudaMalloc(&sp, MAX_ARRAY_SIZE *MAX_STACK_SIZE * sizeof(Value)));

        checkCudaErrors(cudaMemcpy(cu_left, left, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
        checkCudaErrors(cudaMemcpy(cu_right, right, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
        checkCudaErrors(cudaMemcpy(cu_loop_num, loop_num, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));

        ExpNode expression[11];
        ExpNode *cu_expression;

        //Manually create post-fix expression
        expression[0].idx = 0;
        expression[0].type = EXP_VALUE;
        expression[1].idx = 1;
        expression[1].type = EXP_VALUE;
        expression[2].type = EXP_EQUAL;
        expression[3].idx = 0;
        expression[3].type = EXP_VALUE;
        expression[4].idx = 1;
        expression[4].type = EXP_VALUE;
        expression[5].type = EXP_GREATER;
        expression[6].type = EXP_OR;
        expression[7].idx = 0;
        expression[7].type = EXP_VALUE;
        expression[8].idx = 1;
        expression[8].type = EXP_VALUE;
        expression[9].type = EXP_LESS;
        expression[10].type = EXP_OR;

        checkCudaErrors(cudaMalloc(&cu_expression, 11 * sizeof(ExpNode)));
        checkCudaErrors(cudaMemcpy(cu_expression, expression, 11 * sizeof(ExpNode), cudaMemcpyHostToDevice));

        dim3 grid_size(nBLK, 1, 1);
        dim3 block_size(nTPB, 1, 1);

        struct timeval start, end;

        gettimeofday(&start, NULL);
        exp_evaluate<<<grid_size, block_size>>>(cu_expression, 11, cu_left, cu_right, cu_output, cu_loop_num, sp);
        res = cudaGetLastError();
        if (res != cudaSuccess) {
                printf("Error: kernel launch failed. Error code: %s\n", cudaGetErrorString(res));
                return -1;
        }

        checkCudaErrors(cudaDeviceSynchronize());
        gettimeofday(&end, NULL);

        printf("Execution time: %lu usec\n", (end.tv_sec - start.tv_sec) * 1000000 + end.tv_usec - start.tv_usec);

checkCudaErrors(cudaMemcpy(output, cu_output, MAX_ARRAY_SIZE * sizeof(int), cudaMemcpyDeviceToHost));

        checkCudaErrors(cudaFree(cu_left));
        checkCudaErrors(cudaFree(cu_right));
        checkCudaErrors(cudaFree(cu_output));
        checkCudaErrors(cudaFree(cu_loop_num));
        checkCudaErrors(cudaFree(cu_expression));
        free(left);
        free(right);
        free(loop_num);
        printf("End\n");

        return 0;
}

when I compile and run this code, the execution time does not vary for stack size of 4 vs. 64:

$ nvcc -arch=sm_52 -I/usr/local/cuda/samples/common/inc -DMAX_STACK_SIZE=4 main2.cu -o test2
$ ./test2
Execution time: 41047 usec
End
$ nvcc -arch=sm_52 -I/usr/local/cuda/samples/common/inc -DMAX_STACK_SIZE=64 main2.cu -o test2
$ ./test2
Execution time: 41017 usec
End
$
  1. We can check local vars layout by taking their addresses

  2. “persistent threads” approach seems applicable here - instead of picking up free resources, pick up the jobs

  3. The stack var seems like the great candidate for shared memory - it will automatically limit occupancy, avoid meaningless flushing to L2$/DRAM, and improve perfromance due to banked instead of coalesced access. This way we can avoid atomics completely

  4. Evaluation cycle may be optimized for warp-wide execution in this way (well, it will be useful only if each thread will evaluate its own expression):

Val *p = stack+1; //stack ptr, reserve one cell
...
for(...)
{
  Val left = p[-1], right = p[0];  // fetch stack values prior to the switch
  switch(operation)
  {...}  // adjust p according to the number of values consumed
  p[0] = result;  // store result after the switch
}
  1. Also, what’s the meaning of “while (i < loop_num)”? The result should be the same on each cycle

I’m not sure I believe that. Maybe I don’t understand what you mean.

  1. with respect to comparing addresses between threads, the local address for each “stack” in this case should be local. I’m not sure there is any defined relation between the local address in one thread/stack and the local address in another thread/stack.

  2. With respect to a single thread, the local address of adjacent items in the stack should be adjacent. Otherwise all sorts of C/C++ pointer concepts break down.

In the case of a “stack size” of 256, that implies 256161024 bytes needed in total for a 1024 thread threadblock, i.e. 4MB, much larger than the available shared memory. This would still only achieve 50% of max occupancy (2048 threads). Allowing the shared memory size to be an “occupancy limiter” would reduce the max thread complement to 48KB/(256*16) ie. 12 threads. Not sure that would be a good idea. The atomics I am referring to would be executed once per threadblock, to establish the initial stack pointer. It should be of negligible impact. The atomic method I have in mind is similar to what is described here to establish a reusable per-threadblock resource or scratch space:

http://stackoverflow.com/questions/37362554/minimize-the-number-of-curand-states-stored-during-a-mc-simulation/37381193#37381193

  1. You are absolutely right about taking addresses of local vars. My bad

  2. Persistent threads approach here has the same amount of atomic operations - one atomicAdd on global var in order to take next job (which is equivalent to work perfromed by one thread block in the usual approach). PT approach may be conceptually simpler, but otherwise there should be no much difference

  3. Shared memory approach seriously limits occupancy, but ensures that memory operations cannot be a bottleneck. It seems that this algorithm is of pretty usual class - it can be compiled by CUDA as-is, but perfromance suffers. It may be optimized in many ways:

a) replace single 16-byte Val with separately-managed 8-byte Val and 4-bit Type values, each having its own stack
b) compile expression to CUDA C++/PTX code and then to GPU machine code on the fly. This code can use registers instead of thread/global memory, and GPUs has more registers thean caches
c) use registers as indexed variables (via if/shuffle)
d) implement 2- or 3-level hierarchy, including registers, shared and global memory - top of the stack may be kept in registers and rest of the stack - in shared and then global memory

@txbob: thank you for your detail explanation! Now I see what happened with my code. This problem took me lots of time to figure it out. Really appreciate your help.

>> Having said all that, this code suggests to me the potential for a lot of thread/divergent behavior.

Can you point out which parts may result in divergence? Since all elements of two input arrays are evaluated by a same post-fix expression, I think all threads should follow same execution path.

@BulatZiganshin: Actually, I implemented a kernel that similar to your idea about manipulating value stack (int64_t) and Type stack (ValType) separately. Here is the code:

__device__ Value evaluate(ExpNode *exp, int exp_size, Value left, Value right, int64_t *val_stack, ValType *type_stack, int offset)
{
	int top = 0;
	ValType l_type, r_type;
	int64_t l_val, r_val;

	for (int i = 0; i < exp_size; i++) {
		switch (exp[i].type) {
		case EXP_VALUE: {
			if (exp[i].idx == 0) {
				val_stack[top] = left.getValue();
				type_stack[top] = left.getValueType();
				top += offset;
			} else {
				val_stack[top] = right.getValue();
				type_stack[top] = left.getValueType();
				top += offset;
			}
			break;
		}
		case EXP_AND: {
			top -= offset;
			r_type = type_stack[top];
			r_val = val_stack[top];
			top -= offset;
			l_type = type_stack[top];
			l_val = val_stack[top];
			val_stack[top] = (int64_t) ((bool)l_val && (bool)r_val);
			type_stack[top] = TYPE_BOOLEAN;
			top += offset;
			break;
		}
		case EXP_OR: {
			top -= offset;
			r_type = type_stack[top];
			r_val = val_stack[top];
			top -= offset;
			l_type = type_stack[top];
			l_val = val_stack[top];
			val_stack[top] = (int64_t) ((bool)l_val || (bool)r_val);
			type_stack[top] = TYPE_BOOLEAN;
			top += offset;
			break;
		}
		case EXP_EQUAL: {
			top -= offset;
			r_type = type_stack[top];
			r_val = val_stack[top];
			top -= offset;
			l_type = type_stack[top];
			l_val = val_stack[top];
			val_stack[top] = (int64_t) (l_val == r_val);
			type_stack[top] = TYPE_BOOLEAN;
			top += offset;
			break;
		}
		case EXP_GREATER: {
			top -= offset;
			r_type = type_stack[top];
			r_val = val_stack[top];
			top -= offset;
			l_type = type_stack[top];
			l_val = val_stack[top];
			val_stack[top] = (int64_t) (l_val > r_val);
			type_stack[top] = TYPE_BOOLEAN;
			top += offset;
			break;
		}
		case EXP_LESS: {
			top -= offset;
			r_type = type_stack[top];
			r_val = val_stack[top];
			top -= offset;
			l_type = type_stack[top];
			l_val = val_stack[top];
			val_stack[top] = (int64_t) (l_val < r_val);
			type_stack[top] = TYPE_BOOLEAN;
			top += offset;
			break;
		}
		default:
			return Value((bool)false, TYPE_BOOLEAN);
		}
	}

	return Value(val_stack[0], type_stack[0]);
}

__global__ void exp_evaluate(ExpNode *exp,
				int exp_size,
				Value *left,
				Value *right,
				int *output,
				int *loop_num,
				int size
				int64_t *val_stack,
				ValType *type_stack								)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int i = 0;
	int offset = blockDim.x * gridDim.x;

	while (i < loop_num[x]) {
	    result = evaluate(shared_exp, exp_size, left[x], right[x], val_stack + x, type_stack + x, offset);
	    count += (result.isTrue()) ? 1 : 0;
	    i++;
	}
	//__syncthreads();
	output[x] = count;
}

The execution time in this case is only 75507 usec on my GTX 970, which is much faster than maintaining a single Value stack!
I used nvprof to trace the execution. Here is what I got:

// Separated stacks: value stack (int64_t) and type stack (ValType)
1 local_load_transactions   Local Load Transactions           118295280   118295280   118295280
1 local_store_transactions  Local Store Transactions           73698158    73698158    73698158
1 gld_transactions          Global Load Transactions           55089621    55089621    55089621
1 gst_transactions          Global Store Transactions          26850083    26850083    26850083
1 dram_read_transactions    Device Memory Read Transactions    56009562    56009562    56009562
1 dram_write_transactions   Device Memory Write Transactions   58049215    58049215    58049215
1 global_hit_rate           Global Hit Rate                      11.84%      11.84%      11.84%
1 local_hit_rate            Local Hit Rate                       35.73%      35.73%      35.73%
// Single stack: only Value stack (class Value)
1 local_load_transactions   Local Load Transactions           396190296   396190296   396190296
1 local_store_transactions  Local Store Transactions          213360576   213360576   213360576
1 gld_transactions          Global Load Transactions           60195510    60195510    60195510
1 gst_transactions          Global Store Transactions         143029112   143029112   143029112
1 dram_read_transactions    Device Memory Read Transactions   134203317   134203317   134203317
1 dram_write_transactions   Device Memory Write Transactions  146623838   146623838   146623838
1 global_hit_rate           Global Hit Rate                       4.69%       4.69%       4.69%
1 local_hit_rate            Local Hit Rate                       35.73%      35.73%      35.73%

How could the memory transactions be so different? I thought they should be the same, because both of those kernels follow a similar execution flow.
Does anyone have any idea why there is so much difference between those two approaches?

If the stack pointers in threads in the same warp get out-of-sync with each other (i.e. different threads in the warp are accessing different relative members in their respective stacks) then the desired coalescing behavior - whether using local memory or my modified interleaved global memory method - will be broken. I had assumed that due to the processing function, the threads may get out of sync with each other, but if your design guarantees that won’t happen then great. One possible way for the threads to get out of sync with each other would be divergence.

Is this an academic exercise or are you trying to solve a real-world problem?

If the latter, have you looked into using runtime compilation instead?