shared memory vs registers
Hi, everybody. I have a such problem. There are a lot of registers in my kernel, so when i profiled my programm i discovered low multiprocessor occupancy because of too much registers usage per thread. As na experiment i'v just tried to use shared memory instead of registers. After that profiler showed higher occupancy but the programm works slower with wrong results. I don't konow is it possible to use shared memory for local variables or there is another way to increase the occupancy? Thanks
Hi, everybody. I have a such problem. There are a lot of registers in my kernel, so when i profiled my programm i discovered low multiprocessor occupancy because of too much registers usage per thread. As na experiment i'v just tried to use shared memory instead of registers. After that profiler showed higher occupancy but the programm works slower with wrong results. I don't konow is it possible to use shared memory for local variables or there is another way to increase the occupancy?
Thanks

#1
Posted 07/26/2017 02:30 PM   
[quote][is there] another way to increase the occupancy[/quote] Maybe, but it is impossible to tell whether that is so or not without looking at the code in question. One thing to look for is the unnecessary use of expensive math functions, such as pow(). If you use an old GPU, switching to one with a newer architecture that provides more registers may be advisable. It is possible to use shared memory for thread-local variables, but you will not be able to store much thread-local data this way, and it will require a pointer to the storage which eats at least one register. Also, access to shared memory isn't as fast as a register access and if occupancy is low (you did not indicate [i]how[/i] low), there may not be enough independent work to cover the additional latency. Note that occupancy is only loosely correlated with performance.
[is there] another way to increase the occupancy

Maybe, but it is impossible to tell whether that is so or not without looking at the code in question. One thing to look for is the unnecessary use of expensive math functions, such as pow(). If you use an old GPU, switching to one with a newer architecture that provides more registers may be advisable.

It is possible to use shared memory for thread-local variables, but you will not be able to store much thread-local data this way, and it will require a pointer to the storage which eats at least one register. Also, access to shared memory isn't as fast as a register access and if occupancy is low (you did not indicate how low), there may not be enough independent work to cover the additional latency.

Note that occupancy is only loosely correlated with performance.

#2
Posted 07/26/2017 02:42 PM   
The code is: __global__ void Everhart_device_centr_body(int *id_ko_array, cu_real *a, cu_real *b, cu_real t_tek, const int offset, cu_real Step) { int index = threadIdx.x + blockIdx.x*blockDim.x; if (index < offset) { // cu_real w_gpz[3] = { 0,0,0 }; // cu_real r0[3] = { 0,0,0 }; // cu_real v0[3] = { 0,0,0 }; // cu_real r_i[3] = { 0,0,0 }; // cu_real v_i[3] = { 0,0,0 }; // cu_real f0[3] = { 0,0,0 }; // cu_real f_i[3] = { 0,0,0 }; // cu_real A1[3] = { 0,0,0 }; // cu_real A2[3] = { 0,0,0 }; // cu_real A3[3] = { 0,0,0 }; // cu_real A4[3] = { 0,0,0 }; // cu_real A5[3] = { 0,0,0 }; // cu_real A6[3] = { 0,0,0 }; // cu_real A7[3] = { 0,0,0 }; // cu_real alpha1[3] = { 0,0,0 }; // cu_real alpha2[3] = { 0,0,0 }; // cu_real alpha3[3] = { 0,0,0 }; // cu_real alpha4[3] = { 0,0,0 }; // cu_real alpha5[3] = { 0,0,0 }; // cu_real alpha6[3] = { 0,0,0 }; // cu_real alpha7[3] = { 0,0,0 }; // cu_real c_forward[7][7]; // cu_real t_ever[7] = {0.056262560526922146* Step, 0.180240691736892364* Step, 0.352624717113169* Step, 0.547153626330555383* Step, 0.734210177215410531* Step, 0.885320946839095768* Step, 0.977520613561287501* Step}; __shared__ cu_real w_gpz[3]; __shared__ cu_real r0[3]; __shared__ cu_real v0[3]; __shared__ cu_real r_i[3]; __shared__ cu_real v_i[3]; __shared__ cu_real f_i[3]; __shared__ cu_real f0[3]; __shared__ cu_real A1[3]; __shared__ cu_real A2[3]; __shared__ cu_real A3[3]; __shared__ cu_real A4[3]; __shared__ cu_real A5[3]; __shared__ cu_real A6[3]; __shared__ cu_real A7[3]; __shared__ cu_real alpha1[3]; __shared__ cu_real alpha2[3]; __shared__ cu_real alpha3[3]; __shared__ cu_real alpha4[3]; __shared__ cu_real alpha5[3]; __shared__ cu_real alpha6[3]; __shared__ cu_real alpha7[3]; __shared__ cu_real t_ever[7]; __shared__ cu_real c_forward[7][7]; t_ever[0] = 0.056262560526922146* Step; t_ever[1] = 0.180240691736892364* Step; t_ever[2] = 0.352624717113169* Step; t_ever[3] = 0.547153626330555383* Step; t_ever[4] = 0.734210177215410531* Step; t_ever[5] = 0.885320946839095768* Step; t_ever[6] = 0.977520613561287501* Step; for (int i =0; i < 3; i++) { A1[i] = 0; A2[i] = 0; A3[i] = 0; A4[i] = 0; A5[i] = 0; A6[i] = 0; A7[i] = 0; alpha1[i] = 0; alpha2[i] = 0; alpha3[i] = 0; alpha4[i] = 0; alpha5[i] = 0; alpha6[i] = 0; alpha7[i] = 0; r0[i] = 0; v0[i] = 0; w_gpz[i] = 0; r_i[i] = 0; v_i[i] = 0; f_i[i] = 0; f0[i] = 0; } for (int i = 0; i < 7; i++) { for (int j = 0; j <= i; j++) { if (i == j) { c_forward[i][j] = 1.0; } else { if (j == 0) { c_forward[i][j] = -t_ever[i - 1] * c_forward[i - 1][0]; } else { c_forward[i][j] = c_forward[i - 1][j - 1] - t_ever[i - 1] * c_forward[i - 1][j]; } } } } r0[0] = a[offset * 0 + index]; r0[1] = a[offset * 1 + index]; r0[2] = a[offset * 2 + index]; v0[0] = a[offset * 3 + index]; v0[1] = a[offset * 4 + index]; v0[2] = a[offset * 5 + index]; centr_body(r0, w_gpz); f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0]; f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1]; f0[2] = w_gpz[2]; for (int n = 0; n < 7; n++) { for (int i = 0; i < 7; i++) { for (int j = 0; j < 3; j++) { r_i[j] = r0[j] + v0[j] * t_ever[i] + 0.5*f0[j] * t_ever[i] * t_ever[i] + (A1[j] * t_ever[i] * t_ever[i] * t_ever[i]) / 6 + (A2[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 12 + (A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 20 + (A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 30.0 + (A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 42.0 + (A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 56.0 + (A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 72.0; v_i[j] = v0[j] + f0[j] * t_ever[i] + A1[j] * t_ever[i] * t_ever[i] / 2 + A2[j] * t_ever[i] * t_ever[i] * t_ever[i] / 3 + A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 4 + A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 5.0 + A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 6.0 + A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 7.0 + A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 8.0; } centr_body(r_i, w_gpz); for (int k = 0; k < 3; k++) { if (k == 0) { f_i[k] = w_gpz[k] + Om2*v_i[1] + OmEk*r_i[0]; } if (k == 1) { f_i[k] = w_gpz[k] - Om2*v_i[0] + OmEk*r_i[1]; } if (k == 2) { f_i[k] = w_gpz[k]; } if (i == 0) { alpha1[k] = (f_i[k] - f0[k]) / t_ever[0]; } if (i == 1) { alpha2[k] = ((f_i[k] - f0[k]) / t_ever[1] - alpha1[k]) / (t_ever[1] - t_ever[0]); } if (i == 2) { alpha3[k] = (((f_i[k] - f0[k]) / t_ever[2] - alpha1[k]) / (t_ever[2] - t_ever[0]) - alpha2[k]) / (t_ever[2] - t_ever[1]); } if (i == 3) { alpha4[k] = ((((f_i[k] - f0[k]) / t_ever[3] - alpha1[k]) / (t_ever[3] - t_ever[0]) - alpha2[k]) / (t_ever[3] - t_ever[1]) - alpha3[k]) / (t_ever[3] - t_ever[2]); } if (i == 4) { alpha5[k] = (((((f_i[k] - f0[k]) / t_ever[4] - alpha1[k]) / (t_ever[4] - t_ever[0]) - alpha2[k]) / (t_ever[4] - t_ever[1]) - alpha3[k]) / (t_ever[4] - t_ever[2]) - alpha4[k]) / (t_ever[4] - t_ever[3]); } if (i == 5) { alpha6[k] = ((((((f_i[k] - f0[k]) / t_ever[5] - alpha1[k]) / (t_ever[5] - t_ever[0]) - alpha2[k]) / (t_ever[5] - t_ever[1]) - alpha3[k]) / (t_ever[5] - t_ever[2]) - alpha4[k]) / (t_ever[5] - t_ever[3]) - alpha5[k]) / (t_ever[5] - t_ever[4]); } if (i == 6) { alpha7[k] = (((((((f_i[k] - f0[k]) / t_ever[6] - alpha1[k]) / (t_ever[6] - t_ever[0]) - alpha2[k]) / (t_ever[6] - t_ever[1]) - alpha3[k]) / (t_ever[6] - t_ever[2]) - alpha4[k]) / (t_ever[6] - t_ever[3]) - alpha5[k]) / (t_ever[6] - t_ever[4]) - alpha6[k]) / (t_ever[6] - t_ever[5]); } A1[k] = alpha1[k] + c_forward[1][0] * alpha2[k] + c_forward[2][0] * alpha3[k] + c_forward[3][0] * alpha4[k] + c_forward[4][0] * alpha5[k] + c_forward[5][0] * alpha6[k] + c_forward[6][0] * alpha7[k]; A2[k] = alpha2[k] + c_forward[2][1] * alpha3[k] + c_forward[3][1] * alpha4[k] + c_forward[4][1] * alpha5[k] + c_forward[5][1] * alpha6[k] + c_forward[6][1] * alpha7[k]; A3[k] = alpha3[k] + c_forward[3][2] * alpha4[k] + c_forward[4][2] * alpha5[k] + c_forward[5][2] * alpha6[k] + c_forward[6][2] * alpha7[k]; A4[k] = alpha4[k] + c_forward[4][3] * alpha5[k] + c_forward[5][3] * alpha6[k] + c_forward[6][3] * alpha7[k]; A5[k] = alpha5[k] + c_forward[5][4] * alpha6[k] + c_forward[6][4] * alpha7[k]; A6[k] = alpha6[k] + c_forward[6][5] * alpha7[k]; A7[k] = alpha7[k]; } } } b[offset * 0 + index] = r0[0] + v0[0] * Step + 0.5*f0[0] * Step*Step + (A1[0] * Step*Step*Step) / 6 + (A2[0] * Step*Step*Step*Step) / 12 + (A3[0] * Step*Step*Step*Step*Step) / 20 + (A4[0] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[0] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[0] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[0] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0; b[offset * 1 + index] = r0[1] + v0[1] * Step + 0.5*f0[1] * Step*Step + (A1[1] * Step*Step*Step) / 6 + (A2[1] * Step*Step*Step*Step) / 12 + (A3[1] * Step*Step*Step*Step*Step) / 20 + (A4[1] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[1] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[1] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[1] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0; b[offset * 2 + index] = r0[2] + v0[2] * Step + 0.5*f0[2] * Step*Step + (A1[2] * Step*Step*Step) / 6 + (A2[2] * Step*Step*Step*Step) / 12 + (A3[2] * Step*Step*Step*Step*Step) / 20 + (A4[2] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[2] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[2] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[2] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0; b[offset * 3 + index] = v0[0] + f0[0] * Step + A1[0] * Step*Step / 2 + A2[0] * Step*Step*Step / 3 + A3[0] * Step*Step*Step*Step / 4 + A4[0] * Step*Step*Step*Step*Step / 5.0 + A5[0] * Step*Step*Step*Step*Step*Step / 6.0 + A6[0] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[0] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0; b[offset * 4 + index] = v0[1] + f0[1] * Step + A1[1] * Step*Step / 2 + A2[1] * Step*Step*Step / 3 + A3[1] * Step*Step*Step*Step / 4 + A4[1] * Step*Step*Step*Step*Step / 5.0 + A5[1] * Step*Step*Step*Step*Step*Step / 6.0 + A6[1] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[1] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0; b[offset * 5 + index] = v0[2] + f0[2] * Step + A1[2] * Step*Step / 2 + A2[2] * Step*Step*Step / 3 + A3[2] * Step*Step*Step*Step / 4 + A4[2] * Step*Step*Step*Step*Step / 5.0 + A5[2] * Step*Step*Step*Step*Step*Step / 6.0 + A6[2] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[2] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0; } } This is numerical integration by Everhart method, where centr_body is device function. I use gtx 960 with 3 gb
The code is:
__global__ void Everhart_device_centr_body(int *id_ko_array, cu_real *a, cu_real *b, cu_real t_tek, const int offset, cu_real Step)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
if (index < offset)
{
// cu_real w_gpz[3] = { 0,0,0 };
// cu_real r0[3] = { 0,0,0 };
// cu_real v0[3] = { 0,0,0 };
// cu_real r_i[3] = { 0,0,0 };
// cu_real v_i[3] = { 0,0,0 };
// cu_real f0[3] = { 0,0,0 };
// cu_real f_i[3] = { 0,0,0 };
// cu_real A1[3] = { 0,0,0 };
// cu_real A2[3] = { 0,0,0 };
// cu_real A3[3] = { 0,0,0 };
// cu_real A4[3] = { 0,0,0 };
// cu_real A5[3] = { 0,0,0 };
// cu_real A6[3] = { 0,0,0 };
// cu_real A7[3] = { 0,0,0 };
// cu_real alpha1[3] = { 0,0,0 };
// cu_real alpha2[3] = { 0,0,0 };
// cu_real alpha3[3] = { 0,0,0 };
// cu_real alpha4[3] = { 0,0,0 };
// cu_real alpha5[3] = { 0,0,0 };
// cu_real alpha6[3] = { 0,0,0 };
// cu_real alpha7[3] = { 0,0,0 };
// cu_real c_forward[7][7];
// cu_real t_ever[7] = {0.056262560526922146* Step, 0.180240691736892364* Step, 0.352624717113169* Step, 0.547153626330555383* Step, 0.734210177215410531* Step, 0.885320946839095768* Step, 0.977520613561287501* Step};

__shared__ cu_real w_gpz[3];
__shared__ cu_real r0[3];
__shared__ cu_real v0[3];
__shared__ cu_real r_i[3];
__shared__ cu_real v_i[3];
__shared__ cu_real f_i[3];
__shared__ cu_real f0[3];
__shared__ cu_real A1[3];
__shared__ cu_real A2[3];
__shared__ cu_real A3[3];
__shared__ cu_real A4[3];
__shared__ cu_real A5[3];
__shared__ cu_real A6[3];
__shared__ cu_real A7[3];
__shared__ cu_real alpha1[3];
__shared__ cu_real alpha2[3];
__shared__ cu_real alpha3[3];
__shared__ cu_real alpha4[3];
__shared__ cu_real alpha5[3];
__shared__ cu_real alpha6[3];
__shared__ cu_real alpha7[3];
__shared__ cu_real t_ever[7];
__shared__ cu_real c_forward[7][7];
t_ever[0] = 0.056262560526922146* Step;
t_ever[1] = 0.180240691736892364* Step;
t_ever[2] = 0.352624717113169* Step;
t_ever[3] = 0.547153626330555383* Step;
t_ever[4] = 0.734210177215410531* Step;
t_ever[5] = 0.885320946839095768* Step;
t_ever[6] = 0.977520613561287501* Step;
for (int i =0; i < 3; i++)
{
A1[i] = 0;
A2[i] = 0;
A3[i] = 0;
A4[i] = 0;
A5[i] = 0;
A6[i] = 0;
A7[i] = 0;
alpha1[i] = 0;
alpha2[i] = 0;
alpha3[i] = 0;
alpha4[i] = 0;
alpha5[i] = 0;
alpha6[i] = 0;
alpha7[i] = 0;
r0[i] = 0;
v0[i] = 0;
w_gpz[i] = 0;
r_i[i] = 0;
v_i[i] = 0;
f_i[i] = 0;
f0[i] = 0;
}

for (int i = 0; i < 7; i++)
{
for (int j = 0; j <= i; j++)
{
if (i == j)
{
c_forward[i][j] = 1.0;
}
else
{
if (j == 0)
{
c_forward[i][j] = -t_ever[i - 1] * c_forward[i - 1][0];
}
else
{
c_forward[i][j] = c_forward[i - 1][j - 1] - t_ever[i - 1] * c_forward[i - 1][j];

}
}
}
}
r0[0] = a[offset * 0 + index];
r0[1] = a[offset * 1 + index];
r0[2] = a[offset * 2 + index];
v0[0] = a[offset * 3 + index];
v0[1] = a[offset * 4 + index];
v0[2] = a[offset * 5 + index];
centr_body(r0, w_gpz);
f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0];
f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1];
f0[2] = w_gpz[2];
for (int n = 0; n < 7; n++)
{
for (int i = 0; i < 7; i++)
{
for (int j = 0; j < 3; j++)
{
r_i[j] = r0[j] + v0[j] * t_ever[i] + 0.5*f0[j] * t_ever[i] * t_ever[i] +
(A1[j] * t_ever[i] * t_ever[i] * t_ever[i]) / 6 +
(A2[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 12 +
(A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 20 +
(A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 30.0 +
(A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 42.0 +
(A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 56.0 +
(A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 72.0;
v_i[j] = v0[j] + f0[j] * t_ever[i] + A1[j] * t_ever[i] * t_ever[i] / 2 +
A2[j] * t_ever[i] * t_ever[i] * t_ever[i] / 3 +
A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 4 +
A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 5.0 +
A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 6.0 +
A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 7.0 +
A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 8.0;
}
centr_body(r_i, w_gpz);
for (int k = 0; k < 3; k++)
{

if (k == 0) { f_i[k] = w_gpz[k] + Om2*v_i[1] + OmEk*r_i[0]; }
if (k == 1) { f_i[k] = w_gpz[k] - Om2*v_i[0] + OmEk*r_i[1]; }
if (k == 2) { f_i[k] = w_gpz[k]; }
if (i == 0) { alpha1[k] = (f_i[k] - f0[k]) / t_ever[0]; }
if (i == 1) { alpha2[k] = ((f_i[k] - f0[k]) / t_ever[1] - alpha1[k]) / (t_ever[1] - t_ever[0]); }
if (i == 2) { alpha3[k] = (((f_i[k] - f0[k]) / t_ever[2] - alpha1[k]) / (t_ever[2] - t_ever[0]) - alpha2[k]) / (t_ever[2] - t_ever[1]); }
if (i == 3) { alpha4[k] = ((((f_i[k] - f0[k]) / t_ever[3] - alpha1[k]) / (t_ever[3] - t_ever[0]) - alpha2[k]) / (t_ever[3] - t_ever[1]) - alpha3[k]) / (t_ever[3] - t_ever[2]); }
if (i == 4) { alpha5[k] = (((((f_i[k] - f0[k]) / t_ever[4] - alpha1[k]) / (t_ever[4] - t_ever[0]) - alpha2[k]) / (t_ever[4] - t_ever[1]) - alpha3[k]) / (t_ever[4] - t_ever[2]) - alpha4[k]) / (t_ever[4] - t_ever[3]); }
if (i == 5) { alpha6[k] = ((((((f_i[k] - f0[k]) / t_ever[5] - alpha1[k]) / (t_ever[5] - t_ever[0]) - alpha2[k]) / (t_ever[5] - t_ever[1]) - alpha3[k]) / (t_ever[5] - t_ever[2]) - alpha4[k]) / (t_ever[5] - t_ever[3]) - alpha5[k]) / (t_ever[5] - t_ever[4]); }
if (i == 6) { alpha7[k] = (((((((f_i[k] - f0[k]) / t_ever[6] - alpha1[k]) / (t_ever[6] - t_ever[0]) - alpha2[k]) / (t_ever[6] - t_ever[1]) - alpha3[k]) / (t_ever[6] - t_ever[2]) - alpha4[k]) / (t_ever[6] - t_ever[3]) - alpha5[k]) / (t_ever[6] - t_ever[4]) - alpha6[k]) / (t_ever[6] - t_ever[5]); }
A1[k] = alpha1[k] + c_forward[1][0] * alpha2[k] + c_forward[2][0] * alpha3[k] + c_forward[3][0] * alpha4[k] + c_forward[4][0] * alpha5[k] + c_forward[5][0] * alpha6[k] + c_forward[6][0] * alpha7[k];
A2[k] = alpha2[k] + c_forward[2][1] * alpha3[k] + c_forward[3][1] * alpha4[k] + c_forward[4][1] * alpha5[k] + c_forward[5][1] * alpha6[k] + c_forward[6][1] * alpha7[k];
A3[k] = alpha3[k] + c_forward[3][2] * alpha4[k] + c_forward[4][2] * alpha5[k] + c_forward[5][2] * alpha6[k] + c_forward[6][2] * alpha7[k];
A4[k] = alpha4[k] + c_forward[4][3] * alpha5[k] + c_forward[5][3] * alpha6[k] + c_forward[6][3] * alpha7[k];
A5[k] = alpha5[k] + c_forward[5][4] * alpha6[k] + c_forward[6][4] * alpha7[k];
A6[k] = alpha6[k] + c_forward[6][5] * alpha7[k];
A7[k] = alpha7[k];
}
}
}
b[offset * 0 + index] = r0[0] + v0[0] * Step + 0.5*f0[0] * Step*Step + (A1[0] * Step*Step*Step) / 6 + (A2[0] * Step*Step*Step*Step) / 12 + (A3[0] * Step*Step*Step*Step*Step) / 20 + (A4[0] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[0] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[0] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[0] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
b[offset * 1 + index] = r0[1] + v0[1] * Step + 0.5*f0[1] * Step*Step + (A1[1] * Step*Step*Step) / 6 + (A2[1] * Step*Step*Step*Step) / 12 + (A3[1] * Step*Step*Step*Step*Step) / 20 + (A4[1] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[1] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[1] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[1] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
b[offset * 2 + index] = r0[2] + v0[2] * Step + 0.5*f0[2] * Step*Step + (A1[2] * Step*Step*Step) / 6 + (A2[2] * Step*Step*Step*Step) / 12 + (A3[2] * Step*Step*Step*Step*Step) / 20 + (A4[2] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[2] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[2] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[2] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
b[offset * 3 + index] = v0[0] + f0[0] * Step + A1[0] * Step*Step / 2 + A2[0] * Step*Step*Step / 3 + A3[0] * Step*Step*Step*Step / 4 + A4[0] * Step*Step*Step*Step*Step / 5.0 + A5[0] * Step*Step*Step*Step*Step*Step / 6.0 + A6[0] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[0] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
b[offset * 4 + index] = v0[1] + f0[1] * Step + A1[1] * Step*Step / 2 + A2[1] * Step*Step*Step / 3 + A3[1] * Step*Step*Step*Step / 4 + A4[1] * Step*Step*Step*Step*Step / 5.0 + A5[1] * Step*Step*Step*Step*Step*Step / 6.0 + A6[1] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[1] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
b[offset * 5 + index] = v0[2] + f0[2] * Step + A1[2] * Step*Step / 2 + A2[2] * Step*Step*Step / 3 + A3[2] * Step*Step*Step*Step / 4 + A4[2] * Step*Step*Step*Step*Step / 5.0 + A5[2] * Step*Step*Step*Step*Step*Step / 6.0 + A6[2] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[2] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
}
}

This is numerical integration by Everhart method, where centr_body is device function. I use gtx 960 with 3 gb

#3
Posted 07/26/2017 03:01 PM   
For good measure, throw in some #pragma unroll statements into loops that have a known (and low) trip count. Does it improve things? It will get rid of a lot of the required indexing computations. [code] #pragma unroll 3 for (int i =0; i < 3; i++) [/code] A few more things. Putting those variables into __shared__ like you did will break as soon as there is more than 1 thread in a thread block, unless you also introduce a *thread specific* indexing into each shared memory variable. It is highly likely you will run out of shared memory before you can reach a reasonable number of threads in a block with this method. Defining your arrays with local scope (like in the commented out version) and indexing them in a for loop like this... [code] cu_real A1[3]; for (int i =0; i < 3; i++) { // access A1[i] [/code] ...forces all these arrays to reside in local memory. And that is going to be painfully slow. A #pragma unroll statement in front such loops defining indices will allow the arrays to live in registers instead (the index variable will be known at compilation time). My recommendation is: forget about the __shared__ memory, unroll all your indexing loops.
For good measure, throw in some #pragma unroll statements into loops that have a known (and low) trip count. Does it improve things? It will get rid of a lot of the required indexing computations.

#pragma unroll 3
for (int i =0; i < 3; i++)


A few more things. Putting those variables into __shared__ like you did will break as soon as there is more than 1 thread in a thread block, unless you also introduce a *thread specific* indexing into each shared memory variable. It is highly likely you will run out of shared memory before you can reach a reasonable number of threads in a block with this method.


Defining your arrays with local scope (like in the commented out version) and indexing them in a for loop like this...

cu_real A1[3];
for (int i =0; i < 3; i++)
{
// access A1[i]


...forces all these arrays to reside in local memory. And that is going to be painfully slow. A #pragma unroll statement in front such loops defining indices will allow the arrays to live in registers instead (the index variable will be known at compilation time).

My recommendation is: forget about the __shared__ memory, unroll all your indexing loops.

#4
Posted 07/26/2017 03:45 PM   
What type is 'cu_real'? If it is 'float' try compiling -prec-div=false and make sure all floating-point literal constants have a trailing 'f' suffix, otherwise your computation will be promoted to 'double', making it slow on this consumer GPU. Is this some sort of machine-generated code? I would suggest arranging the polynomial evaluations according to the Horner scheme. I am not familiar with this (complex) Everhart method of integration, are you sure it is superior to simpler integration algorithms?
What type is 'cu_real'? If it is 'float' try compiling -prec-div=false and make sure all floating-point literal constants have a trailing 'f' suffix, otherwise your computation will be promoted to 'double', making it slow on this consumer GPU.

Is this some sort of machine-generated code? I would suggest arranging the polynomial evaluations according to the Horner scheme. I am not familiar with this (complex) Everhart method of integration, are you sure it is superior to simpler integration algorithms?

#5
Posted 07/26/2017 03:57 PM   
Does this describe the integration method used? Everhart, Edgar. "Implicit single-sequence methods for integrating orbits." [i]Celestial Mechanics and Dynamical Astronomy[/i] 10.1 (1974): 35-55. [quote]A #pragma unroll statement in front such loops defining such indices will allow the arrays to live in registers instead.[/quote] Good general advice, but this would tend to increase register pressure which OP claims is a problem already. I think (haven't checked) that writing the polynomials in the form of serial multiplies creates many temporary variables through CSE, which is why trying nested multiplication (i.e. Horner scheme) should be advantageous and is likely to also improve accuracy. E.g.: (A7[j]/72.0 * t_ever[i] + A6[j]/56.0) * t_ever[i] + ...
Does this describe the integration method used?

Everhart, Edgar. "Implicit single-sequence methods for integrating orbits." Celestial Mechanics and Dynamical Astronomy 10.1 (1974): 35-55.

A #pragma unroll statement in front such loops defining such indices will allow the arrays to live in registers instead.

Good general advice, but this would tend to increase register pressure which OP claims is a problem already.

I think (haven't checked) that writing the polynomials in the form of serial multiplies creates many temporary variables through CSE, which is why trying nested multiplication (i.e. Horner scheme) should be advantageous and is likely to also improve accuracy. E.g.:

(A7[j]/72.0 * t_ever[i] + A6[j]/56.0) * t_ever[i] + ...

#6
Posted 07/26/2017 04:16 PM   
I doubt OP's problem is register pressure alone. There appears to be an excessive amount of local memory access in the code - and the attempted fix using shared memory is not going to work as is. Seeing screenshots of the profiling results would be helpful
I doubt OP's problem is register pressure alone. There appears to be an excessive amount of local memory access in the code - and the attempted fix using shared memory is not going to work as is.

Seeing screenshots of the profiling results would be helpful

#7
Posted 07/26/2017 04:29 PM   
He mentioned the profiler, that's all we know at this point. We don't know what the register usage actually is or what compiler options were used to build the code. The compiler's unroller may or may not have unrolled the loops in question automatically. My reading on Everhart integration suggests use of double-precision arithmetic (i.e. cu_real = double) since the algorithm seems to be targeted at cases where accuracy if already a problem. In which case use of DP on a consumer GPU would be the biggest issue. The code as posted doesn't compile because of this: [code] centr_body(r0, w_gpz); f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0]; f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1]; [/code] I hacked up the code so it compiles. Using CUDA 8.0, with cu_real = float, compiled for sm_50, I see: (1) Code as-is: 100 registers (2) as (1), but with floating-point literals as 'float' constants (with 'f' suffix): 92 registers (3) as (2), plus __restrict__ on kernel pointer arguments: 87 registers (4) as (3), but with -prec-div=false: 80 registers (5) as (3), plus 'unroll 3' on innermost loops: 120 registers (6) as (3), plus unrolling of all fixed-iteration loops: 104 registers The actual register usage is likely a bit higher due to omitted code. If cu_real = double, it would be higher still.
He mentioned the profiler, that's all we know at this point. We don't know what the register usage actually is or what compiler options were used to build the code. The compiler's unroller may or may not have unrolled the loops in question automatically.

My reading on Everhart integration suggests use of double-precision arithmetic (i.e. cu_real = double) since the algorithm seems to be targeted at cases where accuracy if already a problem. In which case use of DP on a consumer GPU would be the biggest issue.

The code as posted doesn't compile because of this:

centr_body(r0, w_gpz);
f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0];
f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1];


I hacked up the code so it compiles. Using CUDA 8.0, with cu_real = float, compiled for sm_50, I see:

(1) Code as-is: 100 registers
(2) as (1), but with floating-point literals as 'float' constants (with 'f' suffix): 92 registers
(3) as (2), plus __restrict__ on kernel pointer arguments: 87 registers
(4) as (3), but with -prec-div=false: 80 registers
(5) as (3), plus 'unroll 3' on innermost loops: 120 registers
(6) as (3), plus unrolling of all fixed-iteration loops: 104 registers

The actual register usage is likely a bit higher due to omitted code. If cu_real = double, it would be higher still.

#8
Posted 07/26/2017 04:37 PM   
njuffa is right. I run this kernel with cu_real = double and profiler showes me about 150 registers per thread. I don't remember exactly number. centr_body is device function but you need do sum calculations for w_gpz[0] = -mu_e*r[0]/(modR*modR*modR). I'm sorry but I can show the results of profiling only tomorrow.
njuffa is right. I run this kernel with cu_real = double and profiler showes me about 150 registers per thread. I don't remember exactly number. centr_body is device function but you need do sum calculations for w_gpz[0] = -mu_e*r[0]/(modR*modR*modR). I'm sorry but I can show the results of profiling only tomorrow.

#9
Posted 07/26/2017 05:41 PM   
With cu_real=double I would assume the code is limited by the throughput of the double-precision units (as I recall, 1/64 the throughput of the single-precision units on the GPU used), rather than occupancy. Use nested multiplication for all polynomials to reduce the number of double-precision operations needed. With nested multiplication (Horner scheme), FMAs will be used instead of separate multiplies and adds. Also consider explicitly converting the divisions in these into multiplications with the reciprocals, as this will save a lot of double-precision operations. To keep the computation flexible with regard to type, I would suggest pulling out the weights into separate 'const' variables. E.g.: [code] cu_real tmp = t_ever[i]; const cu_real aw0 = 1.0 / 2.0; const cu_real aw1 = 1.0 / 6.0; const cu_real aw2 = 1.0 / 12.0; const cu_real aw3 = 1.0 / 20.0; const cu_real aw4 = 1.0 / 30.0; const cu_real aw5 = 1.0 / 42.0; const cu_real aw6 = 1.0 / 56.0; const cu_real aw7 = 1.0 / 72.0; r_i[j] = (((((((((A7[j] * aw7) * tmp + A6[j] * aw6) * tmp + A5[j] * aw5) * tmp + A4[j] * aw4) * tmp + A3[j] * aw3) * tmp + A2[j] * aw2) * tmp + A1[j] * aw1) * tmp + aw0 * f0[j]) * tmp + v0[j]) * tmp + r0[j]; [/code] By all means, try adding 'const' and '__restrict__' attributes to the pointer arguments to the kernel as appropriate. This will give the compiler greater flexibility in scheduling loads. For cu_real = double, I see register usage at 160 for the code as-is, and 152 when __restrict__ pointers are used. When also changing the polynomials in the way shown above the register usage reduces to 146.
With cu_real=double I would assume the code is limited by the throughput of the double-precision units (as I recall, 1/64 the throughput of the single-precision units on the GPU used), rather than occupancy.

Use nested multiplication for all polynomials to reduce the number of double-precision operations needed. With nested multiplication (Horner scheme), FMAs will be used instead of separate multiplies and adds. Also consider explicitly converting the divisions in these into multiplications with the reciprocals, as this will save a lot of double-precision operations. To keep the computation flexible with regard to type, I would suggest pulling out the weights into separate 'const' variables. E.g.:

cu_real tmp = t_ever[i];
const cu_real aw0 = 1.0 / 2.0;
const cu_real aw1 = 1.0 / 6.0;
const cu_real aw2 = 1.0 / 12.0;
const cu_real aw3 = 1.0 / 20.0;
const cu_real aw4 = 1.0 / 30.0;
const cu_real aw5 = 1.0 / 42.0;
const cu_real aw6 = 1.0 / 56.0;
const cu_real aw7 = 1.0 / 72.0;
r_i[j] = (((((((((A7[j] * aw7) * tmp + A6[j] * aw6) * tmp + A5[j] * aw5) * tmp +
A4[j] * aw4) * tmp + A3[j] * aw3) * tmp + A2[j] * aw2) * tmp +
A1[j] * aw1) * tmp + aw0 * f0[j]) * tmp + v0[j]) * tmp + r0[j];


By all means, try adding 'const' and '__restrict__' attributes to the pointer arguments to the kernel as appropriate. This will give the compiler greater flexibility in scheduling loads. For cu_real = double, I see register usage at 160 for the code as-is, and 152 when __restrict__ pointers are used. When also changing the polynomials in the way shown above the register usage reduces to 146.

#10
Posted 07/26/2017 05:58 PM   
Thanks everubody for your comments, I'll try to rewrite my code according your advices. But I don't understand one thing why when I replace registers by shared memeory I get wrong results. If i understand right there is no crossing between threads in my code and so i have no need in synchronization. Whats wrong with shared memory. To be more correct wrong results i get only when i use shared memory for variables f_i or r_i. Other variables work with shared memory correct. p.s. The problem which i try to solve need only double precision because of the calcualtion accuracy. I wrote the code when i used Tesla k40 for calculation. Now i havne't this powerful device and try to use just what i have - gtx 960). Porfiler showes 173 registers without shared memeory and 114 with it. Occupancy with registers is 12.3 and occupancy with shared memeory 25. Occupancy is closely connected with number of registers per thread. Interesting that computetion process with higher occupancy (with shared memeory) is slower.
Thanks everubody for your comments, I'll try to rewrite my code according your advices. But I don't understand one thing why when I replace registers by shared memeory I get wrong results. If i understand right there is no crossing between threads in my code and so i have no need in synchronization. Whats wrong with shared memory. To be more correct wrong results i get only when i use shared memory for variables f_i or r_i. Other variables work with shared memory correct.
p.s. The problem which i try to solve need only double precision because of the calcualtion accuracy. I wrote the code when i used Tesla k40 for calculation. Now i havne't this powerful device and try to use just what i have - gtx 960).
Porfiler showes 173 registers without shared memeory and 114 with it. Occupancy with registers is 12.3 and occupancy with shared memeory 25. Occupancy is closely connected with number of registers per thread. Interesting that computetion process with higher occupancy (with shared memeory) is slower.

#11
Posted 07/27/2017 06:22 AM   
If your access to shared memory is exclusive between each thread, it should work without synchronization. So you would index shmem_f_i[index*3 + i] instead of the thread-local f_i[i]. Note that depending on your block size, shared memory could be exhausted fairly quickly. f_i[] comprises three 'double' values of 8 bytes each, so 24 bytes per thread. With regard to register pressure, I am not sure this is the biggest performance bottleneck. I still think you would want to minimize the number of double-precision operations first. Are you using CUDA 8? Different compiler versions may be tuned differently with regard to aggressive scheduling of loads, for example, and this has an impact on register pressure. Occupancy is only loosely coupled with performance. See Vasily Volkov's "Better Performance at Lower Occupancy": http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf. Shared memory access is slower than register access.
If your access to shared memory is exclusive between each thread, it should work without synchronization. So you would index shmem_f_i[index*3 + i] instead of the thread-local f_i[i]. Note that depending on your block size, shared memory could be exhausted fairly quickly. f_i[] comprises three 'double' values of 8 bytes each, so 24 bytes per thread.

With regard to register pressure, I am not sure this is the biggest performance bottleneck. I still think you would want to minimize the number of double-precision operations first. Are you using CUDA 8? Different compiler versions may be tuned differently with regard to aggressive scheduling of loads, for example, and this has an impact on register pressure.

Occupancy is only loosely coupled with performance. See Vasily Volkov's "Better Performance at Lower Occupancy": http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf. Shared memory access is slower than register access.

#12
Posted 07/27/2017 06:50 AM   
I'm using CUDA 8. I'v thought about decreasing of double-precision operations especcialy when there is no any need in them. I need to read about Horner schemes for better understanding how to use it in my code. Maybe there is another way to parallelize the propagation of many space objects using Everhart, though I have a good speed in compare with CPU. I'm just confused about occupancy (12.5 is too low) but you'v said that occupancy is only loosely correlated with performance.
I'm using CUDA 8. I'v thought about decreasing of double-precision operations especcialy when there is no any need in them. I need to read about Horner schemes for better understanding how to use it in my code. Maybe there is another way to parallelize the propagation of many space objects using Everhart, though I have a good speed in compare with CPU. I'm just confused about occupancy (12.5 is too low) but you'v said that occupancy is only loosely correlated with performance.

#13
Posted 07/27/2017 07:09 AM   
In the context of polynomial evaluation Horner's scheme is simply a fancy name for nested multiplication, as shown in my example in #10. The Wikipedia article is already overly complicated for what is needed here. Horner's scheme typically maximizes accuracy and maps well to FMA (fused multiply-add) operations, which are the workhorse of GPU floating-point computation. If we needed more instruction-level parallelism one could use a second-order Horner scheme but this would increase register pressure as would other evaluation schemes such as Estrin's.
In the context of polynomial evaluation Horner's scheme is simply a fancy name for nested multiplication, as shown in my example in #10. The Wikipedia article is already overly complicated for what is needed here. Horner's scheme typically maximizes accuracy and maps well to FMA (fused multiply-add) operations, which are the workhorse of GPU floating-point computation. If we needed more instruction-level parallelism one could use a second-order Horner scheme but this would increase register pressure as would other evaluation schemes such as Estrin's.

#14
Posted 07/27/2017 07:19 AM   
Thanks for Horner schemes. They increased speed of calculation about 2 times. I'm realy glad for this help.
Thanks for Horner schemes. They increased speed of calculation about 2 times. I'm realy glad for this help.

#15
Posted 07/27/2017 08:03 AM   
Scroll To Top

Add Reply