I’ve been trying to optimize some CUDA code and I’ve noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.
Here’s some simplified code that illustrates the issue:
__global__
void foo(float *o) {
__shared__ float a[16];
int const x = floorf(0.5f);
int const y = ceilf(0.5f);
a[threadIdx.x+x] = threadIdx.y; // threadIdx.x+x --> %rd7
a[threadIdx.x+y] = threadIdx.z; // threadIdx.x+y --> %rd10
o[threadIdx.x] = a[threadIdx.x];
}
If I compile it with “nvcc -c t.cu -keep -O3 -o t.o” and then examine the PTX output, the two commented lines from above become:
cvt.u64.u32 %rd5, %r7; //
mul.lo.u64 %rd6, %rd5, 4; //
add.u64 %rd7, %rd1, %rd6; //
st.shared.f32 [%rd7+0], %f4; // id:26 __cuda_a8+0x0
.loc 14 8 0
cvt.u64.u32 %rd8, %r1; //
mul.lo.u64 %rd9, %rd8, 4; //
add.u64 %rd10, %rd9, %rd1; //
ld.shared.f32 %f7, [%rd10+0]; // id:27 __cuda_a8+0x0
%rd7 holds the index for the first shared memory store. I would have expected the compiler to re-use %rd7 to hold the index for the second shared memory store, but instead it allocates a new register, %rd10, for that task.
The only solution that I’ve found so far is to set --maxrregcount manually for the whole compilation unit, which has some obvious downsides.
Is there any way to get nvcc to automatically consolidate registers so that new ones are not allocated when there are already existing ones that can be reused (because they will never be referenced again)?