Interesting.
The metrics I would focus on are:
$ nvcc -arch=sm_52 -Xptxas -v -DMAX_STACK_SIZE=4 -I/usr/local/cuda/samples/common/inc -o test main.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i' for 'sm_52'
ptxas info : Function properties for _Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 26 registers, 372 bytes cmem[0], 32 bytes cmem[2]
$ nvprof --metrics local_store_transactions,local_load_transactions,dram_read_transactions,dram_write_transactions,l2_utilization ./test
==2660== NVPROF is profiling process 2660, command: ./test
==2660== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==2660== Replaying kernel "exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (done)
Execution time: 132677 usec
End
==2660== Profiling application: ./test
==2660== Profiling result:
==2660== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 960 (0)"
Kernel: exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)
1 local_store_transactions Local Store Transactions 107656379 107656379 107656379
1 local_load_transactions Local Load Transactions 95126884 95126884 95126884
1 dram_read_transactions Device Memory Read Transactions 1319005 1319005 1319005
1 dram_write_transactions Device Memory Write Transactions 49669178 49669178 49669178
1 l2_utilization L2 Cache Utilization Max (10) Max (10) Max (10)
$ nvcc -arch=sm_52 -Xptxas -v -DMAX_STACK_SIZE=256 -I/usr/local/cuda/samples/common/inc -o test main.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i' for 'sm_52'
ptxas info : Function properties for _Z12exp_evaluateP7ExpNodeiP5ValueS2_PiS3_i
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 372 bytes cmem[0], 24 bytes cmem[2]
$ nvprof --metrics local_store_transactions,local_load_transactions,dram_read_transactions,dram_write_transactions,l2_utilization ./test
==2718== NVPROF is profiling process 2718, command: ./test
==2718== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Execution time: 1908386 usecxp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (0 of 2)...
==2718== Replaying kernel "exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)" (done)
End
==2718== Profiling application: ./test
==2718== Profiling result:
==2718== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 960 (0)"
Kernel: exp_evaluate(ExpNode*, int, Value*, Value*, int*, int*, int)
1 local_store_transactions Local Store Transactions 1915478518 1915478518 1915478518
1 local_load_transactions Local Load Transactions 95062058 95062058 95062058
1 dram_read_transactions Device Memory Read Transactions 34390085 34390085 34390085
1 dram_write_transactions Device Memory Write Transactions 1911826037 1911826037 1911826037
1 l2_utilization L2 Cache Utilization Mid (5) Mid (5) Mid (5)
$
We see that in the MAX_STACK_SIZE = 256 case, the local store transactions increase by 20x, the local load transactions remain about the same, the dram read transactions increase by 25x, and the dram write transactions increase by 40x.
Bucking the trend, the L2 utilization (effectively a measure of average L2 bandwidth) goes down in the 256 stack size case.
So what we have here is a local memory organization that in one case (4) is more L2-friendly, and in the other case (256) is less L2-friendly. The reason it is less L2 friendly is that it is spending more time in dram latency hell (i.e. this is why the L2 utilization metric oddly goes down).
We can get some clues about local memory organization from the programming guide. First, lets remind ourselves how local memory is organized:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
“Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs.”
now what’s meant by thread ID specifically? That refers back to section 2.2:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
“The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same;…”
So we see that organization by thread ID implies a block-level organization, not a grid-level organization.
Suppose I had a grid-level organization, with threadblocks of 4 threads each, and a stack size of 3:
TB0: TB1: TB2:
0 1 2 3 4 5 6 7 8 9 10 11 ...
0 1 2 3 4 5 6 7 8 9 10 11 ...
0 1 2 3 4 5 6 7 8 9 10 11 ...
But instead we have a block-level organization:
0 1 2 3
4 5 6 7
8 9 10 11
0 1 2 3
4 5 6 7
8 9 10 11
0 1 2 3
4 5 6 7
8 9 10 11
...
The takeaway is that in the first case, the size of the stack (the vertical dimension) should not matter - elements at the top of each stack are still grouped together in memory, followed by the next element in each stack – throughout the grid. This is true regardless of stack size.
But in the second case, this is not true. If we increase the stack (the vertical dimension) elements at the top of the stack in TB0 get farther away from elements at the top of the stack in TB1.
This then has implications for the L2 cache, where local loads and stores will attempt to hit before they escape to dram.
The size of your stack element is 16 bytes. So a stack size of 4 means 64 bytes per thread, or 64KB per 1024-thread threadblock.
For an L2 cache of ~1792KB on GTX 970, that means that with a stack size of 4, 1792/64 or about 28 threadblocks (out of 1024 total) worth of stack frames could fit in L2 cache at any time.
If we’re only using that much of the larger stack, then with a fully associative cache, there shouldn’t be much difference. But the L2 cache is not fully associative, and you are introducing a fairly large step in the access pattern by doing this. So I think it’s possible that this effectively results in thrashing of the L2 cache (the metrics suggest this to me, anyway) which is resulting in the dramatic increase of total number of transactions to dram, in the slower 256 case.
This is still a pretty hand-wavy explanation, and it definitely does not address all possible questions (like why are the local store transactions so much higher while the local loads are not).
However, I think it should be possible to avoid the block vs. grid organization issue by using global memory for the stack, instead of local memory, since you can organize it anyway you want. You mention that you tried this, but you may not have organized it correctly. If you simply carve out a contiguous chunk for each thread, that won’t work - you’re back into the vertical vs. horizontal trap. I think you want to carve the global allocation in such a way that each thread’s stack is interleaved - which is effectively what the local memory organization does - but you will interleave across the grid instead of across the threadblock. A problem you will run into here is that for a large stack size, this will be a really large allocation - a stack size of 256 with your grid size (1048576 threads) requires an allocation of ~4GB. We can improve on this by determining max instantaneous occupancy (it is 2 threadblocks * number of GPU SMs for this particular code) and using an atomic technique to allow threadblocks to reuse the stack area of other threadblocks that have retired. This would reduce the global memory allocation required down to a very manageable 2 * #SMs * number of threads per block * stack size * 16 (i.e. sizeof(Value)).
Having said all that, this code suggests to me the potential for a lot of thread/divergent behavior, which may lead each thread to be accessing different relative elements in the stack. Depending on how this divergence works out in your “real” application, and the actual stack size, then none of this fiddling may matter. You may still end up thrashing the L2 with large scale disorganized accesses.
To demonstrate that global memory could be used to work around this via reorganization of the underlying data storage pattern, here’s a modification of your code that uses a single global memory allocation for the stack, and each thread has a strided/interleaved stack.
//main2.cu
#include <iostream>
#include <stdio.h>
#include <helper_cuda.h>
#include <helper_functions.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
//#define MAX_STACK_SIZE 4
#define nTPB 1024
#define nBLK 1024
#define MAX_ARRAY_SIZE (nTPB * nBLK)
typedef enum {
EXP_VALUE = 0,
EXP_AND,
EXP_OR,
EXP_EQUAL,
EXP_GREATER,
EXP_LESS
} ExpType;
typedef struct {
ExpType type;
int idx; //0: left array, 1: right array
} ExpNode;
typedef enum {
TYPE_INVALID = 0,
TYPE_NULL,
TYPE_BOOLEAN,
TYPE_INTEGER,
TYPE_DOUBLE
} ValType;
class Value {
public:
__host__ __device__ Value(): value_(0), type_(TYPE_NULL) {}
__host__ __device__ Value(int64_t val, ValType val_type): value_(val), type_(val_type) {}
inline __device__ Value equal(Value &h) {
int64_t h_val = h.getValue();
return (value_ == h_val) ? getTrue() : getFalse();
}
inline __device__ Value greater_than(Value &h) {
int64_t h_val = h.getValue();
return (value_ > h_val) ? getTrue() : getFalse();
}
inline __device__ Value less_than(Value &h) {
int64_t h_val = h.getValue();
return (value_ < h_val) ? getTrue() : getFalse();
}
inline __device__ Value log_and(Value &h) {
bool h_val = (bool)(h.getValue());
return ((bool)value_ && h_val) ? getTrue() : getFalse();
}
inline __device__ Value log_or(Value &h) {
bool h_val = (bool)(h.getValue());
return ((bool)value_ || h_val) ? getTrue() : getFalse();
}
static __device__ Value getTrue() {
return Value((int64_t)true, TYPE_BOOLEAN);
}
static __device__ Value getFalse() {
return Value((int64_t)false, TYPE_BOOLEAN);
}
inline __host__ __device__ void setData(int64_t val, ValType val_type) {
value_ = val;
type_ = val_type;
}
__device__ bool isTrue() {
return ((bool)value_);
}
__device__ bool isFalse() {
return !isTrue();
}
__device__ int64_t getValue() {
return value_;
}
__device__ ValType getValueType() {
return type_;
}
private:
int64_t value_;
ValType type_;
};
__device__ Value evaluate(ExpNode *exp, int exp_size, Value left, Value right, Value *stack_ptr)
{
Value ltemp, rtemp, *stack = stack_ptr;
int incr = MAX_ARRAY_SIZE;
for (int i = 0; i < exp_size; i++) {
switch (exp[i].type) {
case EXP_VALUE: {
if (exp[i].idx == 0) {
*stack_ptr = left; stack_ptr += incr;
} else {
*stack_ptr = right; stack_ptr += incr;
}
break;
}
case EXP_AND: {
stack_ptr -= incr; ltemp = *stack_ptr;
stack_ptr -= incr; rtemp = *stack_ptr;
*stack_ptr = ltemp.log_and(rtemp); stack_ptr += incr;
break;
}
case EXP_OR: {
stack_ptr -= incr; ltemp = *stack_ptr;
stack_ptr -= incr; rtemp = *stack_ptr;
*stack_ptr = ltemp.log_or(rtemp); stack_ptr += incr;
break;
}
case EXP_EQUAL: {
stack_ptr -= incr; ltemp = *stack_ptr;
stack_ptr -= incr; rtemp = *stack_ptr;
*stack_ptr = ltemp.equal(rtemp); stack_ptr += incr;
break;
}
case EXP_GREATER: {
stack_ptr -= incr; ltemp = *stack_ptr;
stack_ptr -= incr; rtemp = *stack_ptr;
*stack_ptr = ltemp.greater_than(rtemp); stack_ptr += incr;
break;
}
case EXP_LESS: {
stack_ptr -= incr; ltemp = *stack_ptr;
stack_ptr -= incr; rtemp = *stack_ptr;
*stack_ptr = ltemp.less_than(rtemp); stack_ptr += incr;
break;
}
default:
return Value::getFalse();
}
}
return stack[0];
}
__global__ void exp_evaluate(ExpNode *exp,
int exp_size,
Value *left,
Value *right,
int *output,
int *loop_num,
Value *my_stack)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int i = 0;
Value result;
int count = 0;
while (i < loop_num[x]) {
result = evaluate(exp, exp_size, left[x], right[x], my_stack+x);
count += (result.isTrue()) ? 1 : 0;
i++;
}
//__syncthreads();
output[x] = count;
}
int main()
{
Value *left, *right, *sp;
int *loop_num;
left = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
if (left == NULL) {
printf("Failed to allocate left array. \n");
return -1;
}
right = (Value *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
if (right == NULL) {
printf("Failed to allocate right array. \n");
return -1;
}
loop_num = (int *)malloc(sizeof(Value) * MAX_ARRAY_SIZE);
if (loop_num == NULL) {
printf("Failed to allocate loop_num array. \n");
return -1;
}
srand(time(NULL));
//Manually set values for left and right arrays
for (int i = 0; i < MAX_ARRAY_SIZE; i++) {
left[i].setData(1, TYPE_INTEGER);
right[i].setData(1, TYPE_INTEGER);
loop_num[i] = rand() % 20 + 1;
}
cudaError_t res;
Value *cu_left, *cu_right;
int *cu_loop_num, *cu_output, output[MAX_ARRAY_SIZE];
checkCudaErrors(cudaMalloc(&cu_left, MAX_ARRAY_SIZE * sizeof(Value)));
checkCudaErrors(cudaMalloc(&cu_right, MAX_ARRAY_SIZE * sizeof(Value)));
checkCudaErrors(cudaMalloc(&cu_loop_num, MAX_ARRAY_SIZE * sizeof(Value)));
checkCudaErrors(cudaMalloc(&cu_output, MAX_ARRAY_SIZE * sizeof(int)));
checkCudaErrors(cudaMalloc(&sp, MAX_ARRAY_SIZE *MAX_STACK_SIZE * sizeof(Value)));
checkCudaErrors(cudaMemcpy(cu_left, left, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(cu_right, right, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(cu_loop_num, loop_num, MAX_ARRAY_SIZE * sizeof(Value), cudaMemcpyHostToDevice));
ExpNode expression[11];
ExpNode *cu_expression;
//Manually create post-fix expression
expression[0].idx = 0;
expression[0].type = EXP_VALUE;
expression[1].idx = 1;
expression[1].type = EXP_VALUE;
expression[2].type = EXP_EQUAL;
expression[3].idx = 0;
expression[3].type = EXP_VALUE;
expression[4].idx = 1;
expression[4].type = EXP_VALUE;
expression[5].type = EXP_GREATER;
expression[6].type = EXP_OR;
expression[7].idx = 0;
expression[7].type = EXP_VALUE;
expression[8].idx = 1;
expression[8].type = EXP_VALUE;
expression[9].type = EXP_LESS;
expression[10].type = EXP_OR;
checkCudaErrors(cudaMalloc(&cu_expression, 11 * sizeof(ExpNode)));
checkCudaErrors(cudaMemcpy(cu_expression, expression, 11 * sizeof(ExpNode), cudaMemcpyHostToDevice));
dim3 grid_size(nBLK, 1, 1);
dim3 block_size(nTPB, 1, 1);
struct timeval start, end;
gettimeofday(&start, NULL);
exp_evaluate<<<grid_size, block_size>>>(cu_expression, 11, cu_left, cu_right, cu_output, cu_loop_num, sp);
res = cudaGetLastError();
if (res != cudaSuccess) {
printf("Error: kernel launch failed. Error code: %s\n", cudaGetErrorString(res));
return -1;
}
checkCudaErrors(cudaDeviceSynchronize());
gettimeofday(&end, NULL);
printf("Execution time: %lu usec\n", (end.tv_sec - start.tv_sec) * 1000000 + end.tv_usec - start.tv_usec);
checkCudaErrors(cudaMemcpy(output, cu_output, MAX_ARRAY_SIZE * sizeof(int), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(cu_left));
checkCudaErrors(cudaFree(cu_right));
checkCudaErrors(cudaFree(cu_output));
checkCudaErrors(cudaFree(cu_loop_num));
checkCudaErrors(cudaFree(cu_expression));
free(left);
free(right);
free(loop_num);
printf("End\n");
return 0;
}
when I compile and run this code, the execution time does not vary for stack size of 4 vs. 64:
$ nvcc -arch=sm_52 -I/usr/local/cuda/samples/common/inc -DMAX_STACK_SIZE=4 main2.cu -o test2
$ ./test2
Execution time: 41047 usec
End
$ nvcc -arch=sm_52 -I/usr/local/cuda/samples/common/inc -DMAX_STACK_SIZE=64 main2.cu -o test2
$ ./test2
Execution time: 41017 usec
End
$