Register usage
Hello all,


I wanted to increase the register usage for a kernel. So I set the compilation flag --maxrregcount=16 for the code below,


[code]
__global__ void kernel(float* xIN,float* yIN,float* zIN, float* tmIN, int NumVertices,
float* xOUT,float* yOUT,float* zOUT )
{

int i = threadIdx.x + blockIdx.x*blockDim.x; // get thread ID

float x;

float y;

float z;

if(i<NumVertices){
x = xIN[i];

y = yIN[i];

z = zIN[i];


xOUT[i] = tmIN[0] * x + tmIN[4] * y + tmIN[8] * z + tmIN[12];

yOUT[i] = tmIN[1] * x + tmIN[5] * y + tmIN[9] * z + tmIN[13];

zOUT[i] = tmIN[2] * x + tmIN[6] * y + tmIN[10] * z + tmIN[14];
}

}[/code]



The --ptxas-options=-v option shows,

ptxas info : Compiling entry function '_Z9transformPfS_S_S_iS_S_S_' for 'sm_13'
ptxas info : Used 8 registers, 64+16 bytes smem



But when I set --maxrregcount=32 and also declare some extra variables in the kernel like below, still the register usage is same?

[code]__global__ void transform(float* xIN,float* yIN,float* zIN, float* tmIN, int NumVertices,
float* xOUT,float* yOUT,float* zOUT )
{

int i = threadIdx.x + blockIdx.x*blockDim.x; // get thread ID

float x;

float y;

float z;

register float a,b,c,d,f,r,w,q,s,h,n;
a=b=c=d=f=r=w=q=s=h=n=4;
a=b+c+d+f+r+w+q+s+h+n;
a=b+a;

if(i<NumVertices){
x = xIN[i];

y = yIN[i];

z = zIN[i];


xOUT[i] = tmIN[0] * x + tmIN[4] * y + tmIN[8] * z + tmIN[12];

yOUT[i] = tmIN[1] * x + tmIN[5] * y + tmIN[9] * z + tmIN[13];

zOUT[i] = tmIN[2] * x + tmIN[6] * y + tmIN[10] * z + tmIN[14];
}

}
[/code]

ptxas info : Compiling entry function '_Z9transformPfS_S_S_iS_S_S_' for 'sm_13'
ptxas info : Used 8 registers, 64+16 bytes smem

note: arch=13


Thanks for any help in this regards.
Hello all,





I wanted to increase the register usage for a kernel. So I set the compilation flag --maxrregcount=16 for the code below,







__global__ void kernel(float* xIN,float* yIN,float* zIN, float* tmIN, int NumVertices,

float* xOUT,float* yOUT,float* zOUT )

{



int i = threadIdx.x + blockIdx.x*blockDim.x; // get thread ID



float x;



float y;



float z;



if(i<NumVertices){

x = xIN[i];



y = yIN[i];



z = zIN[i];





xOUT[i] = tmIN[0] * x + tmIN[4] * y + tmIN[8] * z + tmIN[12];



yOUT[i] = tmIN[1] * x + tmIN[5] * y + tmIN[9] * z + tmIN[13];



zOUT[i] = tmIN[2] * x + tmIN[6] * y + tmIN[10] * z + tmIN[14];

}



}








The --ptxas-options=-v option shows,



ptxas info : Compiling entry function '_Z9transformPfS_S_S_iS_S_S_' for 'sm_13'

ptxas info : Used 8 registers, 64+16 bytes smem







But when I set --maxrregcount=32 and also declare some extra variables in the kernel like below, still the register usage is same?



__global__ void transform(float* xIN,float* yIN,float* zIN, float* tmIN, int NumVertices,

float* xOUT,float* yOUT,float* zOUT )

{



int i = threadIdx.x + blockIdx.x*blockDim.x; // get thread ID



float x;



float y;



float z;



register float a,b,c,d,f,r,w,q,s,h,n;

a=b=c=d=f=r=w=q=s=h=n=4;

a=b+c+d+f+r+w+q+s+h+n;

a=b+a;



if(i<NumVertices){

x = xIN[i];



y = yIN[i];



z = zIN[i];





xOUT[i] = tmIN[0] * x + tmIN[4] * y + tmIN[8] * z + tmIN[12];



yOUT[i] = tmIN[1] * x + tmIN[5] * y + tmIN[9] * z + tmIN[13];



zOUT[i] = tmIN[2] * x + tmIN[6] * y + tmIN[10] * z + tmIN[14];

}



}




ptxas info : Compiling entry function '_Z9transformPfS_S_S_iS_S_S_' for 'sm_13'

ptxas info : Used 8 registers, 64+16 bytes smem



note: arch=13





Thanks for any help in this regards.

-<b> programing </b>is fun but it exhausts you ... WHY ?

-- GTX 480, GTX 285 and 9200M

#1
Posted 03/12/2012 10:41 AM   
--maxrregcount does only set an upper limit to the number of registers to use, it doesn't force the compiler to use more registers.

Moreover, in your second example, none of your variables are used for something "useful". This is just dead code that one can expect the compiler to eliminate.
--maxrregcount does only set an upper limit to the number of registers to use, it doesn't force the compiler to use more registers.



Moreover, in your second example, none of your variables are used for something "useful". This is just dead code that one can expect the compiler to eliminate.

#2
Posted 03/12/2012 11:01 AM   
[quote name='Gilles_C' date='12 March 2012 - 04:01 PM' timestamp='1331550119' post='1381643']
--maxrregcount does only set an upper limit to the number of registers to use, it doesn't force the compiler to use more registers.

Moreover, in your second example, none of your variables are used for something "useful". This is just dead code that one can expect the compiler to eliminate.
[/quote]


So any way to force compiler to use registers?
[quote name='Gilles_C' date='12 March 2012 - 04:01 PM' timestamp='1331550119' post='1381643']

--maxrregcount does only set an upper limit to the number of registers to use, it doesn't force the compiler to use more registers.



Moreover, in your second example, none of your variables are used for something "useful". This is just dead code that one can expect the compiler to eliminate.







So any way to force compiler to use registers?

-<b> programing </b>is fun but it exhausts you ... WHY ?

-- GTX 480, GTX 285 and 9200M

#3
Posted 03/13/2012 10:54 AM   
Write code that would benefit from using registers. In your sample case, each thread could loop over multiple values for [font="Courier New"]i[/font], so that [font="Courier New"]tmIN[][/font] could be kept in registers and thus read less often.
Write code that would benefit from using registers. In your sample case, each thread could loop over multiple values for i, so that tmIN[] could be kept in registers and thus read less often.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 03/13/2012 11:09 AM   
It is not clear what you are trying to accomplish by artifically inflating register usage, but a relatively effective way to do that would be to multiply some variable by a math function that evaluates to 1.0f (you might want to try powf(1.0f,1.0f), make sure it actually evalutes to 1.0f in case there are rounding errors). Obviously this approach would also increase dynamic instruction count which may or may not conflict with the purpose of the exercise.
It is not clear what you are trying to accomplish by artifically inflating register usage, but a relatively effective way to do that would be to multiply some variable by a math function that evaluates to 1.0f (you might want to try powf(1.0f,1.0f), make sure it actually evalutes to 1.0f in case there are rounding errors). Obviously this approach would also increase dynamic instruction count which may or may not conflict with the purpose of the exercise.

#5
Posted 03/13/2012 06:13 PM   
Scroll To Top