nvcc/ptxas under-utilizing registers for arrays

Hi all,

I’m writing a simple kernel, but I’m having huge trouble getting nvcc/ptxas to use registers for auto arrays (on the stack). Here’s my kernel, for which ptxas reports

24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 24 bytes cumulative stack size, 336 bytes cmem[0]

If I enlarge the array size SSS only the stack size changes in jumps of 8 or 16 bytes.

What is going on???

Thanks for any input!

template <int unused>
__global__ void sumarray(int numElements, int *command, int command_length, float *out)
{
#define SSS 4
	volatile
	float data[SSS];
	int start=numElements/gridDim.x*blockIdx.x + threadIdx.x;
	int end=numElements/gridDim.x*(blockIdx.x+1);

#pragma unroll 
for (int j=0; j<SSS ; j++) {
		data[j]=dev_data[command_length][start+j*blockDim.x+j];
	}
	float sum=0;
#pragma unroll 
	for (int j=0; j<SSS ; j++)
		sum+=data[j];

	out[start]=sum;
}

The volatile keyword seems to disable the effect you are looking for. Since the (“stack”) array in question is thread-local, it’s not clear to me that you need volatile or why you would use it.

The following complete example scales its register usage up as you increase the size of SSS, whereas the stack size does not change, at least for small values of SSS:

$ cat t748.cu
#include <stdio.h>

template <typename T>
__global__ void kernel(T *data){

  T sdata[SSS];
#pragma unroll
  for (int i = 0; i < SSS; i++)
    sdata[i] = data[i];

  T sum = 0.0f;
#ifdef WITH_PRINTF
#pragma unroll
  for (int i = 0; i < SSS; i++)
    printf("sdata[%d] = %f\n", i, sdata[i]);
#endif
#pragma unroll
  for (int i = 0; i < SSS; i++)
    sum += sdata[i];

  printf("sum = %f\n", sum);
}

int main(){

  float *data;
  cudaMalloc(&data, SSS*sizeof(float));
  cudaMemset(data, 0, SSS*sizeof(float));
  kernel<<<1,1>>>(data);
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -DWITH_PRINTF -DSSS=4 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 26 bytes gmem, 16 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 40 bytes cmem[0]
$ nvcc -DWITH_PRINTF -DSSS=8 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 26 bytes gmem, 16 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 25 registers, 40 bytes cmem[0]
$ nvcc -DWITH_PRINTF -DSSS=16 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 26 bytes gmem, 16 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 37 registers, 40 bytes cmem[0]

Note that the compiler is pretty smart. If I omit the printf statement in the middle, the compiler does considerable optimization and the register usage is essentially unchanged for small values of SSS:

$ nvcc -DSSS=4 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 10 bytes gmem, 8 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 40 bytes cmem[0]
$ nvcc -DSSS=8 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 10 bytes gmem, 8 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 40 bytes cmem[0]
$ nvcc -DSSS=16 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info    : 10 bytes gmem, 8 bytes cmem[14]
ptxas info    : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info    : Function properties for _Z6kernelIfEvPT_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 40 bytes cmem[0]

If I add volatile before the sdata variable definition above, then we don’t observe this behavior, and the stack usage is much larger.

(The compiler can only convert stack arrays to registers when it can determine the necessary register usage at compile time, since registers can’t be indexed into like memory. But your code already seems to take that into account.)

Thanks for your reply, txbob!

The volatile keyword was there by mistake, part of experimenting…

In your printf-free example, where are the array elements stored? local memory, right? If so, the compiler does not seem so smart…

I’m doing this registers experiment because I’d like to minimize shared memory/L1 access, since it’s slow.

In the printf-free example, the compiler has optimized away the actual sdata storage and done a global optimization on the kernel, involving register reuse. If you want more details, you should learn how to use the cuda binary utilities:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

and dump the SASS:

cuobjdump -sass t748

and study it, in whichever cases you wish to compare.

txbob,
But where did the compiler hid the 16 entry array, given only 9 registers were used?
Also, the point of the first loop is to minimize global memory latency.
Looks like the compiler optimization hurt the performance of that specific kernel… too bad.

It’s mysterious!!

Fortunately, all the tools you need to unravel this “mystery” are right here:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

Thanks for your help txbob!
Anyone from NVidia can explain why registers aren’t being used for small arrays?
Seams that it happen on linux as well, as txbob showed. I’m using Windows 7, CUDA 7.0, compiling for sm_30 only.
Thanks,

Registers are not indexable like memory. Therefore, the first pre-requisite to allocate thread-local arrays in registers is that all indexing can be resolved at compile time, i.e. all indexes are compile time constants. Optimizations like unrolling of loops an inlining of functions may be necessary to discover that indexes are in fact compile-time computable. Second, registers are generally a precious resource and the CUDA compiler will allocate thread-local arrays to registers only if they are small. What is considered “small” is governed by a compiler-internal heuristic and likely architecture dependent.

If the code is transformed to allocate a thread local array to registers, the compiler will eliminate all array entries that are not used in the code, so the total number of registers used can be less than the number of entries in the array. The effect could be re-inforced if different array elements can be allocated to the same register at different times in the code; really the array elements are treated as a collection of scalar data in such a case.

Very well.

Let’s consider the short case of just 4 elements in our “local” array, which we are loading from global memory, i.e. exactly the case in my code.

The compiler could generate something like this:

// #pragma unroll
//   for (int i = 0; i < SSS; i++)
//     sdata[i] = data[i];

LD R0, data[0]
LD R1, data[1]
LD R2, data[2]
LD R3, data[3]

...

// #pragma unroll
//   for (int i = 0; i < SSS; i++)
//     sum += sdata[i];

MOV R5, 0
ADD R5, R0
ADD R5, R1
ADD R5, R2
ADD R5, R3

However, the compiler is pretty smart. Assuming nothing of importance happens at the point represented by …, the compiler can observe that there’s no particular need to use 4 registers to store the “local” array. Instead, it can replace all of the above with this:

LD  R5, data[0]
LD  R0, data[1]
ADD R5, R0
LD  R0, data[2]
ADD R5, R0
LD  R0, data[3]
ADD R5, R0

and the end result, in R5, is identical. The above optimization could be desirable for several reasons, one of which is reduction of register pressure, another of which is interspersing of LD/ST operations with arithmetic operations. Hopefully you can see that the above “optimization” by the compiler could transform the usage of any size array to just a constant set of registers, which is exactly what we see in the case where I don’t have printf.

Throwing the printf in the middle (where I have depicted … above) ruins the above optimization, and forces the compiler to actually allocate registers equal to the size of the “local” array, which, again, is exactly what we observe.

Sorry txbob, I didn’t realize you work for NVidia.
Mathematically, you are right, the result is the same on both binaries.
However, performance-wise the register-free version is slower.

I would think the real-life relevance is whether this makes a noticeable difference to the performance of your application. Put differently, “Don’t sweat the small stuff”.

Generally speaking, compilers implement a complex set of transformations, often driven by heuristics. As a result, performance of the generated code can be good or excellent for a wide variety of codes and coding styles, but there will inevitably be cases where the performance of the generated code is poor. Beneficial transformations can be added, heuristics can be tuned, but often this comes at the expense of compile time.

Where adding transformations or tuning heuristics can be shown to be beneficial to the performance of a largish subset of codes, or results in significant performance improvements (say, greater than 5%) to important applications (e.g. weather forecasting, synthetic aperture radar, facial recognition, medical imaging), filing an RFE (request for enhancement) with NVIDIA would be an appropriate course of action. RFEs can be filed through the bug reporting form linked from the CUDA registered developer page.

I suppose that depends on a number of factors, many of which haven’t been discussed. Niether of us have actually listed performance numbers, or provided a framework and example indicating how it was being measured.

Most of what I’ve been trying to respond to in this thread, were what I perceived to be questions about compiler mechanics. Your questions started out seeming to me to be “why isn’t the compiler doing this?” or, “how could the compiler do that?”

Finally found out it only happen when compiling in debug (visual studio 2013 community here).
In release mode it indeed uses registers.

But in release mode nsight does not allow to debug the kernel.
Any way to debug a release kernel or make a debug kernel use registers?

Thanks,

CUDA debug builds are compiled with all optimizations disabled. Keeping variables “cached” in a register is an optimization as well, by default thread-local data “lives” in local memory. You would not want to use debug builds for any form of performance assessment.

I pretty much never use debug builds for debugging, instead I just insert printf() calls into my regular release builds. Since printf() is a function with side-effects, this can also have an impact on code generation, but it is often minor. Since I have used to that method of debugging ever since I worked in an embedded environment many years ago (where those printfs would write out data through a serial port!) I am usually no slower debugging issues than other engineers armed with the latest fancy debuggers.