Getting nvcc to consolidate registers
I've been trying to optimize some CUDA code and I've noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.

Here's some simplified code that illustrates the issue:

[code]__global__
void foo(float *o) {
__shared__ float a[16];
int const x = floorf(0.5f);
int const y = ceilf(0.5f);
a[threadIdx.x+x] = threadIdx.y; // threadIdx.x+x --> %rd7
a[threadIdx.x+y] = threadIdx.z; // threadIdx.x+y --> %rd10
o[threadIdx.x] = a[threadIdx.x];
}[/code]

If I compile it with "nvcc -c t.cu -keep -O3 -o t.o" and then examine the PTX output, the two commented lines from above become:
[code] cvt.u64.u32 %rd5, %r7; //
mul.lo.u64 %rd6, %rd5, 4; //
add.u64 %rd7, %rd1, %rd6; //
st.shared.f32 [%rd7+0], %f4; // id:26 __cuda_a8+0x0
.loc 14 8 0
cvt.u64.u32 %rd8, %r1; //
mul.lo.u64 %rd9, %rd8, 4; //
add.u64 %rd10, %rd9, %rd1; //
ld.shared.f32 %f7, [%rd10+0]; // id:27 __cuda_a8+0x0[/code]

%rd7 holds the index for the first shared memory store. I would have expected the compiler to re-use %rd7 to hold the index for the second shared memory store, but instead it allocates a new register, %rd10, for that task.

The only solution that I've found so far is to set --maxrregcount manually for the whole compilation unit, which has some obvious downsides.

Is there any way to get nvcc to automatically consolidate registers so that new ones are not allocated when there are already existing ones that can be reused (because they will never be referenced again)?
I've been trying to optimize some CUDA code and I've noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.



Here's some simplified code that illustrates the issue:



__global__

void foo(float *o) {

__shared__ float a[16];

int const x = floorf(0.5f);

int const y = ceilf(0.5f);

a[threadIdx.x+x] = threadIdx.y; // threadIdx.x+x --> %rd7

a[threadIdx.x+y] = threadIdx.z; // threadIdx.x+y --> %rd10

o[threadIdx.x] = a[threadIdx.x];

}




If I compile it with "nvcc -c t.cu -keep -O3 -o t.o" and then examine the PTX output, the two commented lines from above become:

cvt.u64.u32 	%rd5, %r7;	   	// 

mul.lo.u64 %rd6, %rd5, 4; //

add.u64 %rd7, %rd1, %rd6; //

st.shared.f32 [%rd7+0], %f4; // id:26 __cuda_a8+0x0

.loc 14 8 0

cvt.u64.u32 %rd8, %r1; //

mul.lo.u64 %rd9, %rd8, 4; //

add.u64 %rd10, %rd9, %rd1; //

ld.shared.f32 %f7, [%rd10+0]; // id:27 __cuda_a8+0x0




%rd7 holds the index for the first shared memory store. I would have expected the compiler to re-use %rd7 to hold the index for the second shared memory store, but instead it allocates a new register, %rd10, for that task.



The only solution that I've found so far is to set --maxrregcount manually for the whole compilation unit, which has some obvious downsides.



Is there any way to get nvcc to automatically consolidate registers so that new ones are not allocated when there are already existing ones that can be reused (because they will never be referenced again)?

#1
Posted 02/14/2009 06:30 PM   
PTX is an intermediate language, not the final assembly output. Use decuda to verify your assumption.

Consensus here, so far, has been that register reuse is done in the final stage of translating the PTX code to native machine instructions.

However I have often been able to reduce register usage at the PTX level by carefully making selected local variables "volatile"- it effects compiler optimization such that the compiler puts the value into a register immediately. I even do this for constants (e.g. 1.0 or 0.0) that are needed more than once. This saves registers because constants usually keep getting loaded into registers over and over - even if the same constant has been loaded previously. The volatile trick is a nice workaround - however I have only tested it with the 1.1 and 2.0 SDK so far.

Christian
PTX is an intermediate language, not the final assembly output. Use decuda to verify your assumption.



Consensus here, so far, has been that register reuse is done in the final stage of translating the PTX code to native machine instructions.



However I have often been able to reduce register usage at the PTX level by carefully making selected local variables "volatile"- it effects compiler optimization such that the compiler puts the value into a register immediately. I even do this for constants (e.g. 1.0 or 0.0) that are needed more than once. This saves registers because constants usually keep getting loaded into registers over and over - even if the same constant has been loaded previously. The volatile trick is a nice workaround - however I have only tested it with the 1.1 and 2.0 SDK so far.



Christian

#2
Posted 02/14/2009 08:38 PM   
[quote name='dalleyg' post='505705' date='Feb 14 2009, 12:30 PM']I've been trying to optimize some CUDA code and I've noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.[/quote]

nvcc outputs PTX using static-single assignment (see Wikipedia), which is a common intermediate form to assist in optimization and register assignment (both of which are done in ptxas).
[quote name='dalleyg' post='505705' date='Feb 14 2009, 12:30 PM']I've been trying to optimize some CUDA code and I've noticed that nvcc (v2.1) seems to allocate new registers for new temporary calculations rather than reusing previously-allocated ones. This issue seems to propagate all the way to the final binary code based on what I see in the CUDA profiler register allocation output.



nvcc outputs PTX using static-single assignment (see Wikipedia), which is a common intermediate form to assist in optimization and register assignment (both of which are done in ptxas).

#3
Posted 02/15/2009 01:01 AM   
Thanks for the extra info and suggestions, chbhner1 and seibert.
Thanks for the extra info and suggestions, chbhner1 and seibert.

#4
Posted 02/15/2009 04:31 AM   
I have been trying to optimize my code as well by reducing my register overhead. Though I have not examined the assembly, by studying the ptxas info, it appears that all the registers are not reused and new ones are claimed. This is most obvious when I call a function twice back to back, the second call increases the number of registers even though any temporary variables needed by the function should have been acquired and released by the first call.
I have been trying to optimize my code as well by reducing my register overhead. Though I have not examined the assembly, by studying the ptxas info, it appears that all the registers are not reused and new ones are claimed. This is most obvious when I call a function twice back to back, the second call increases the number of registers even though any temporary variables needed by the function should have been acquired and released by the first call.

#5
Posted 02/16/2009 07:00 PM   
[quote name='JonWaite' post='506595' date='Feb 16 2009, 08:00 PM']I have been trying to optimize my code as well by reducing my register overhead. Though I have not examined the assembly, by studying the ptxas info, it appears that all the registers are not reused and new ones are claimed. This is most obvious when I call a function twice back to back, the second call increases the number of registers even though any temporary variables needed by the function should have been acquired and released by the first call.[/quote]

If this is not in the FAQ, it should be. ptx allocates a new register for every new variable it encounters. In the compilation phase from ptx to device code ptxas optimizes this.
[quote name='JonWaite' post='506595' date='Feb 16 2009, 08:00 PM']I have been trying to optimize my code as well by reducing my register overhead. Though I have not examined the assembly, by studying the ptxas info, it appears that all the registers are not reused and new ones are claimed. This is most obvious when I call a function twice back to back, the second call increases the number of registers even though any temporary variables needed by the function should have been acquired and released by the first call.



If this is not in the FAQ, it should be. ptx allocates a new register for every new variable it encounters. In the compilation phase from ptx to device code ptxas optimizes this.

greets,
Denis

#6
Posted 02/17/2009 05:55 AM   
[quote name='E.D. Riedijk' post='506777' date='Feb 17 2009, 03:55 PM']If this is not in the FAQ, it should be. ptx allocates a new register for every new variable it encounters. In the compilation phase from ptx to device code ptxas optimizes this.[/quote]

To clarify, if you want to see actual register usage of your code, you should be looking at the .cubin file, not the .ptx file. The cubin is the final binary that gets loaded into the GPU, and also contains the final register usage count, as well as shared mem and local mem usage.
[quote name='E.D. Riedijk' post='506777' date='Feb 17 2009, 03:55 PM']If this is not in the FAQ, it should be. ptx allocates a new register for every new variable it encounters. In the compilation phase from ptx to device code ptxas optimizes this.



To clarify, if you want to see actual register usage of your code, you should be looking at the .cubin file, not the .ptx file. The cubin is the final binary that gets loaded into the GPU, and also contains the final register usage count, as well as shared mem and local mem usage.

#7
Posted 02/17/2009 10:40 PM   
[quote name='cbuchner1' post='505739' date='Feb 14 2009, 09:38 PM']PTX is an intermediate language, not the final assembly output. Use decuda to verify your assumption.

Consensus here, so far, has been that register reuse is done in the final stage of translating the PTX code to native machine instructions.

However I have often been able to reduce register usage at the PTX level by carefully making selected local variables "volatile"- it effects compiler optimization such that the compiler puts the value into a register immediately. I even do this for constants (e.g. 1.0 or 0.0) that are needed more than once. This saves registers because constants usually keep getting loaded into registers over and over - even if the same constant has been loaded previously. The volatile trick is a nice workaround - however I have only tested it with the 1.1 and 2.0 SDK so far.

Christian[/quote]

brilliant idea !
using this trick I decreased register usage from 36 to 29,
so that having 128 threads per block I can run 4 blocks per SM now, thanks..
[quote name='cbuchner1' post='505739' date='Feb 14 2009, 09:38 PM']PTX is an intermediate language, not the final assembly output. Use decuda to verify your assumption.



Consensus here, so far, has been that register reuse is done in the final stage of translating the PTX code to native machine instructions.



However I have often been able to reduce register usage at the PTX level by carefully making selected local variables "volatile"- it effects compiler optimization such that the compiler puts the value into a register immediately. I even do this for constants (e.g. 1.0 or 0.0) that are needed more than once. This saves registers because constants usually keep getting loaded into registers over and over - even if the same constant has been loaded previously. The volatile trick is a nice workaround - however I have only tested it with the 1.1 and 2.0 SDK so far.



Christian



brilliant idea !

using this trick I decreased register usage from 36 to 29,

so that having 128 threads per block I can run 4 blocks per SM now, thanks..

#8
Posted 02/20/2009 08:58 AM   
Could you guys give an example how you decrease the register usage with "volatile" keyword? I gave it a try, but the only register usage reduction I achieved resulted in usage of local memory.
Could you guys give an example how you decrease the register usage with "volatile" keyword? I gave it a try, but the only register usage reduction I achieved resulted in usage of local memory.

#9
Posted 03/04/2009 09:19 AM   
[quote name='Radko' post='513451' date='Mar 4 2009, 10:19 AM']Could you guys give an example how you decrease the register usage with "volatile" keyword? I gave it a try, but the only register usage reduction I achieved resulted in usage of local memory.[/quote]

It works best for intermediate variables, look counters, indexes etc. Make these volatile. Examples follow.

Don't make arrays volatile.


[code]// bogus array access example

// unoptimized version
// often the compiler "inlines" the computation needlessly, leading to longer PTX and extra register use.
int tmp = blockIdx.x + 7;
g_array1[tmp] = x; // inlined as g_array1[blockIdx.x + 7] = x;
g_array2[tmp] = y; // inlined as g_array1[blockIdx.x + 7] = y;

// optimized version: tmp is stored in a register and used in both array accesses
// look at PTX to see why this is better.
volatile int tmp = blockIdx.x + 7;
g_array1[tmp] = x;
g_array2[tmp] = y;


// bogus computation example

// unoptimized version of some bogus computation
x = cos ( arg - 1.0 );
y = sin ( arg - 1.0 );


// optimized version: advantage becomes apparent when you look at the PTX

volatile int one = 1.0f; // declare such constants early on, re-use often (where applicable)
x = cos ( arg - one );
y = sin ( arg - one );


// unoptimized initialization example:
// PTX may allocate up to 3 extra registers (first load constant to register, then assign constant to target location)
var1 = 0.0f;
var2 = 0.0f;
var3 = 0.0f;

// optimized initialization example
// better register use: PTX allocates 1 extra register for constant 0
volatile int zero = 0.0f;
var1 = zero;
var2 = zero;
var3 = zero;[/code]

The PTX allocates less registers and the final optimization of register use that happens during translation from PTX to machine code may add some extra efficiency. Without the volatile trick the PTX allocates more registers initially and the "peephole" optimizer (or whatever algorithm is used) does not achieve the same efficiency as if we had helped manually.
[quote name='Radko' post='513451' date='Mar 4 2009, 10:19 AM']Could you guys give an example how you decrease the register usage with "volatile" keyword? I gave it a try, but the only register usage reduction I achieved resulted in usage of local memory.



It works best for intermediate variables, look counters, indexes etc. Make these volatile. Examples follow.



Don't make arrays volatile.





// bogus array access example



// unoptimized version

// often the compiler "inlines" the computation needlessly, leading to longer PTX and extra register use.

int tmp = blockIdx.x + 7;

g_array1[tmp] = x; // inlined as g_array1[blockIdx.x + 7] = x;

g_array2[tmp] = y; // inlined as g_array1[blockIdx.x + 7] = y;



// optimized version: tmp is stored in a register and used in both array accesses

// look at PTX to see why this is better.

volatile int tmp = blockIdx.x + 7;

g_array1[tmp] = x;

g_array2[tmp] = y;





// bogus computation example



// unoptimized version of some bogus computation

x = cos ( arg - 1.0 );

y = sin ( arg - 1.0 );





// optimized version: advantage becomes apparent when you look at the PTX



volatile int one = 1.0f; // declare such constants early on, re-use often (where applicable)

x = cos ( arg - one );

y = sin ( arg - one );





// unoptimized initialization example:

// PTX may allocate up to 3 extra registers (first load constant to register, then assign constant to target location)

var1 = 0.0f;

var2 = 0.0f;

var3 = 0.0f;



// optimized initialization example

// better register use: PTX allocates 1 extra register for constant 0

volatile int zero = 0.0f;

var1 = zero;

var2 = zero;

var3 = zero;




The PTX allocates less registers and the final optimization of register use that happens during translation from PTX to machine code may add some extra efficiency. Without the volatile trick the PTX allocates more registers initially and the "peephole" optimizer (or whatever algorithm is used) does not achieve the same efficiency as if we had helped manually.

#10
Posted 03/04/2009 10:02 AM   
Thanks a lot. I tried it, and it worked wonderfully.
Thanks a lot. I tried it, and it worked wonderfully.

#11
Posted 03/04/2009 10:42 AM   
[quote name='Radko' post='513475' date='Mar 4 2009, 11:42 AM']Thanks a lot. I tried it, and it worked wonderfully.[/quote]

Summarizing all of the above examples in one sentence:

For variables with local scope (that are not forced to local memory like arrays would) the volatile keyword is essentially
the missing "__register" keyword.
[quote name='Radko' post='513475' date='Mar 4 2009, 11:42 AM']Thanks a lot. I tried it, and it worked wonderfully.



Summarizing all of the above examples in one sentence:



For variables with local scope (that are not forced to local memory like arrays would) the volatile keyword is essentially

the missing "__register" keyword.

#12
Posted 03/04/2009 11:31 AM   
[quote name='cbuchner1' post='513487' date='Mar 4 2009, 06:31 AM']Summarizing all of the above examples in one sentence:

For variables with local scope (that are not forced to local memory like arrays would) the volatile keyword is essentially
the missing "__register" keyword.[/quote]

This thread was tremendously useful, it works great!

Could you give me an official reference to include it in my work, because I couldn't find anything helpfull about the volatile keyword in the programming guide, the only thing I found was:

"Only after the execution of a __syncthreads() (Section 4.4.2) are writes to shared variables guaranteed to be visible by other threads. Unless the variable is declared as volatile, the compiler is free to optimize the reads and writes to shared memory as long as the previous statement is met."

and in the ptx_isa_1.2 there is a reference to this reserved keyword and I quote:

"st.volatile may be used with .global and .shared spaces to inhibit optimization of
references to volatile memory. This may be used, for example, to enforce sequential
consistency between threads accessing shared memory."

But neither of them clear the things out for me,
thanks a lot!
[quote name='cbuchner1' post='513487' date='Mar 4 2009, 06:31 AM']Summarizing all of the above examples in one sentence:



For variables with local scope (that are not forced to local memory like arrays would) the volatile keyword is essentially

the missing "__register" keyword.



This thread was tremendously useful, it works great!



Could you give me an official reference to include it in my work, because I couldn't find anything helpfull about the volatile keyword in the programming guide, the only thing I found was:



"Only after the execution of a __syncthreads() (Section 4.4.2) are writes to shared variables guaranteed to be visible by other threads. Unless the variable is declared as volatile, the compiler is free to optimize the reads and writes to shared memory as long as the previous statement is met."



and in the ptx_isa_1.2 there is a reference to this reserved keyword and I quote:



"st.volatile may be used with .global and .shared spaces to inhibit optimization of

references to volatile memory. This may be used, for example, to enforce sequential

consistency between threads accessing shared memory."



But neither of them clear the things out for me,

thanks a lot!

#13
Posted 05/20/2009 09:13 PM   
[quote name='Lermy' post='543533' date='May 20 2009, 11:13 PM']This thread was tremendously useful, it works great!

Could you give me an official reference to include it in my work, because I couldn't find anything helpfull about the volatile keyword in the programming guide, the only thing I found was:[/quote]

Sorry, there is no official reference on use of "volatile" keyword for reducing register pressure. All I know about this is from experience (looking at PTX output before and after inserting volatile). So far it has worked in all SDK versions, and it probably will continue to work until nVidia significantly improve their compiler.
[quote name='Lermy' post='543533' date='May 20 2009, 11:13 PM']This thread was tremendously useful, it works great!



Could you give me an official reference to include it in my work, because I couldn't find anything helpfull about the volatile keyword in the programming guide, the only thing I found was:



Sorry, there is no official reference on use of "volatile" keyword for reducing register pressure. All I know about this is from experience (looking at PTX output before and after inserting volatile). So far it has worked in all SDK versions, and it probably will continue to work until nVidia significantly improve their compiler.

#14
Posted 07/01/2009 09:53 AM   
[quote name='cbuchner1' post='560146' date='Jul 1 2009, 05:53 AM']Sorry, there is no official reference on use of "volatile" keyword for reducing register pressure. All I know about this is from experience (looking at PTX output before and after inserting volatile). So far it has worked in all SDK versions, and it probably will continue to work until nVidia significantly improve their compiler.[/quote]

Thanks so much for sharing this trick, it really helps, dropped my troublesome kernel from 20 to 15 registers at the cost of quite a bit of speed for some reason.

Looking at the PTX code, (can't get decuda to work with G10), I am amazed at how many CVT and MOV commands are wasting registers. I've tried using unsigned shorts to prevent the CVT commands, this results in an increase in the %rh counts ( which I assume is a half-register) and switched from unsigned to signed ints to try to prevent MOV commands that are only moving an unsigned int to a signed int register. I know that optimization takes place after PTX, but without being able to look at my cubin with decuda PTX is all I have at the moment.

I guess I could work on the decuda code to get it working but I'm not a python programmer so that would take a while. Any tricks that you've learned to cut down on some of the CVT and MOV commands?

- Richard
[quote name='cbuchner1' post='560146' date='Jul 1 2009, 05:53 AM']Sorry, there is no official reference on use of "volatile" keyword for reducing register pressure. All I know about this is from experience (looking at PTX output before and after inserting volatile). So far it has worked in all SDK versions, and it probably will continue to work until nVidia significantly improve their compiler.



Thanks so much for sharing this trick, it really helps, dropped my troublesome kernel from 20 to 15 registers at the cost of quite a bit of speed for some reason.



Looking at the PTX code, (can't get decuda to work with G10), I am amazed at how many CVT and MOV commands are wasting registers. I've tried using unsigned shorts to prevent the CVT commands, this results in an increase in the %rh counts ( which I assume is a half-register) and switched from unsigned to signed ints to try to prevent MOV commands that are only moving an unsigned int to a signed int register. I know that optimization takes place after PTX, but without being able to look at my cubin with decuda PTX is all I have at the moment.



I guess I could work on the decuda code to get it working but I'm not a python programmer so that would take a while. Any tricks that you've learned to cut down on some of the CVT and MOV commands?



- Richard

#15
Posted 07/01/2009 12:59 PM   
Scroll To Top

Add Reply