The volatile keyword seems to disable the effect you are looking for. Since the (“stack”) array in question is thread-local, it’s not clear to me that you need volatile or why you would use it.
The following complete example scales its register usage up as you increase the size of SSS, whereas the stack size does not change, at least for small values of SSS:
$ cat t748.cu
#include <stdio.h>
template <typename T>
__global__ void kernel(T *data){
T sdata[SSS];
#pragma unroll
for (int i = 0; i < SSS; i++)
sdata[i] = data[i];
T sum = 0.0f;
#ifdef WITH_PRINTF
#pragma unroll
for (int i = 0; i < SSS; i++)
printf("sdata[%d] = %f\n", i, sdata[i]);
#endif
#pragma unroll
for (int i = 0; i < SSS; i++)
sum += sdata[i];
printf("sum = %f\n", sum);
}
int main(){
float *data;
cudaMalloc(&data, SSS*sizeof(float));
cudaMemset(data, 0, SSS*sizeof(float));
kernel<<<1,1>>>(data);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -DWITH_PRINTF -DSSS=4 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 26 bytes gmem, 16 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 40 bytes cmem[0]
$ nvcc -DWITH_PRINTF -DSSS=8 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 26 bytes gmem, 16 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 40 bytes cmem[0]
$ nvcc -DWITH_PRINTF -DSSS=16 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 26 bytes gmem, 16 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 40 bytes cmem[0]
Note that the compiler is pretty smart. If I omit the printf statement in the middle, the compiler does considerable optimization and the register usage is essentially unchanged for small values of SSS:
$ nvcc -DSSS=4 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 10 bytes gmem, 8 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 10 registers, 40 bytes cmem[0]
$ nvcc -DSSS=8 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 10 bytes gmem, 8 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 40 bytes cmem[0]
$ nvcc -DSSS=16 -arch=sm_20 -O3 -Xptxas -v t748.cu -o t748
ptxas info : 10 bytes gmem, 8 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelIfEvPT_' for 'sm_20'
ptxas info : Function properties for _Z6kernelIfEvPT_
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 40 bytes cmem[0]
If I add volatile before the sdata variable definition above, then we don’t observe this behavior, and the stack usage is much larger.
(The compiler can only convert stack arrays to registers when it can determine the necessary register usage at compile time, since registers can’t be indexed into like memory. But your code already seems to take that into account.)