Reduce the number of register
Hello All,

I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?

The kernel is:

[code]

__kernel void relCalculation(const __global int* a,
const __global double* fProb,
const __global float* T,
__global int* Output,
const int max,
const int idx,
const int col,
const double rel,
const double t)
{

const int i = get_global_id(0);

if (i <= max )
{

double GP = 1;
for (int j = 0; j < col; ++j)
{
GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);
}

if ( GP >= rel)
Output[i] = 1;
else
Output[i] = 0;

}
}



[/code]
Hello All,



I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?



The kernel is:







__kernel void relCalculation(const __global int* a,

const __global double* fProb,

const __global float* T,

__global int* Output,

const int max,

const int idx,

const int col,

const double rel,

const double t)

{



const int i = get_global_id(0);



if (i <= max )

{



double GP = 1;

for (int j = 0; j < col; ++j)

{

GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);

}



if ( GP >= rel)

Output[i] = 1;

else

Output[i] = 0;



}

}






#1
Posted 03/20/2012 07:29 PM   
[quote name='mohd' date='20 March 2012 - 08:29 PM' timestamp='1332271771' post='1385471']
Hello All,

I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?

The kernel is:

[code]

__kernel void relCalculation(const __global int* a,
const __global double* fProb,
const __global float* T,
__global int* Output,
const int max,
const int idx,
const int col,
const double rel,
const double t)
{

const int i = get_global_id(0);

if (i <= max )
{

double GP = 1;
for (int j = 0; j < col; ++j)
{
GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);
}

if ( GP >= rel)
Output[i] = 1;
else
Output[i] = 0;

}
}

[/code]
[/quote]

You could try to put your const vars into constant memory instead of submitting them via parameters, e.g.

__constant double t = 12345;


The constant memory is slower than private memmory but maybe it performs better because of running more threads.

--
Srdja
[quote name='mohd' date='20 March 2012 - 08:29 PM' timestamp='1332271771' post='1385471']

Hello All,



I wrote a kernel, this kernel will execute an equation, and when I use the visual profiler I found that I cant run all available thread in parallel, for example the gpu can run 1563 thread in parallel but I can reach only 1024 due to register factor, and the profiler said that I use 26 register per thread which is high, any one can give me a hint how to reduce the number of register per thread?



The kernel is:







__kernel void relCalculation(const __global int* a,

const __global double* fProb,

const __global float* T,

__global int* Output,

const int max,

const int idx,

const int col,

const double rel,

const double t)

{



const int i = get_global_id(0);



if (i <= max )

{



double GP = 1;

for (int j = 0; j < col; ++j)

{

GP *= pow( (1 - pow(fProb[j + idx], a[i*col + j]) ), t/T[j]);

}



if ( GP >= rel)

Output[i] = 1;

else

Output[i] = 0;



}

}








You could try to put your const vars into constant memory instead of submitting them via parameters, e.g.



__constant double t = 12345;





The constant memory is slower than private memmory but maybe it performs better because of running more threads.



--

Srdja

#2
Posted 03/21/2012 05:02 PM   
Scroll To Top