GET STARTED

GET INVOLVED

Authorization Required

Not a member? Register Now

Thanks

Maybe, but it is impossible to tell whether that is so or not without looking at the code in question. One thing to look for is the unnecessary use of expensive math functions, such as pow(). If you use an old GPU, switching to one with a newer architecture that provides more registers may be advisable.

It is possible to use shared memory for thread-local variables, but you will not be able to store much thread-local data this way, and it will require a pointer to the storage which eats at least one register. Also, access to shared memory isn't as fast as a register access and if occupancy is low (you did not indicate

howlow), there may not be enough independent work to cover the additional latency.Note that occupancy is only loosely correlated with performance.

__global__ void Everhart_device_centr_body(int *id_ko_array, cu_real *a, cu_real *b, cu_real t_tek, const int offset, cu_real Step)

{

int index = threadIdx.x + blockIdx.x*blockDim.x;

if (index < offset)

{

// cu_real w_gpz[3] = { 0,0,0 };

// cu_real r0[3] = { 0,0,0 };

// cu_real v0[3] = { 0,0,0 };

// cu_real r_i[3] = { 0,0,0 };

// cu_real v_i[3] = { 0,0,0 };

// cu_real f0[3] = { 0,0,0 };

// cu_real f_i[3] = { 0,0,0 };

// cu_real A1[3] = { 0,0,0 };

// cu_real A2[3] = { 0,0,0 };

// cu_real A3[3] = { 0,0,0 };

// cu_real A4[3] = { 0,0,0 };

// cu_real A5[3] = { 0,0,0 };

// cu_real A6[3] = { 0,0,0 };

// cu_real A7[3] = { 0,0,0 };

// cu_real alpha1[3] = { 0,0,0 };

// cu_real alpha2[3] = { 0,0,0 };

// cu_real alpha3[3] = { 0,0,0 };

// cu_real alpha4[3] = { 0,0,0 };

// cu_real alpha5[3] = { 0,0,0 };

// cu_real alpha6[3] = { 0,0,0 };

// cu_real alpha7[3] = { 0,0,0 };

// cu_real c_forward[7][7];

// cu_real t_ever[7] = {0.056262560526922146* Step, 0.180240691736892364* Step, 0.352624717113169* Step, 0.547153626330555383* Step, 0.734210177215410531* Step, 0.885320946839095768* Step, 0.977520613561287501* Step};

__shared__ cu_real w_gpz[3];

__shared__ cu_real r0[3];

__shared__ cu_real v0[3];

__shared__ cu_real r_i[3];

__shared__ cu_real v_i[3];

__shared__ cu_real f_i[3];

__shared__ cu_real f0[3];

__shared__ cu_real A1[3];

__shared__ cu_real A2[3];

__shared__ cu_real A3[3];

__shared__ cu_real A4[3];

__shared__ cu_real A5[3];

__shared__ cu_real A6[3];

__shared__ cu_real A7[3];

__shared__ cu_real alpha1[3];

__shared__ cu_real alpha2[3];

__shared__ cu_real alpha3[3];

__shared__ cu_real alpha4[3];

__shared__ cu_real alpha5[3];

__shared__ cu_real alpha6[3];

__shared__ cu_real alpha7[3];

__shared__ cu_real t_ever[7];

__shared__ cu_real c_forward[7][7];

t_ever[0] = 0.056262560526922146* Step;

t_ever[1] = 0.180240691736892364* Step;

t_ever[2] = 0.352624717113169* Step;

t_ever[3] = 0.547153626330555383* Step;

t_ever[4] = 0.734210177215410531* Step;

t_ever[5] = 0.885320946839095768* Step;

t_ever[6] = 0.977520613561287501* Step;

for (int i =0; i < 3; i++)

{

A1[i] = 0;

A2[i] = 0;

A3[i] = 0;

A4[i] = 0;

A5[i] = 0;

A6[i] = 0;

A7[i] = 0;

alpha1[i] = 0;

alpha2[i] = 0;

alpha3[i] = 0;

alpha4[i] = 0;

alpha5[i] = 0;

alpha6[i] = 0;

alpha7[i] = 0;

r0[i] = 0;

v0[i] = 0;

w_gpz[i] = 0;

r_i[i] = 0;

v_i[i] = 0;

f_i[i] = 0;

f0[i] = 0;

}

for (int i = 0; i < 7; i++)

{

for (int j = 0; j <= i; j++)

{

if (i == j)

{

c_forward[i][j] = 1.0;

}

else

{

if (j == 0)

{

c_forward[i][j] = -t_ever[i - 1] * c_forward[i - 1][0];

}

else

{

c_forward[i][j] = c_forward[i - 1][j - 1] - t_ever[i - 1] * c_forward[i - 1][j];

}

}

}

}

r0[0] = a[offset * 0 + index];

r0[1] = a[offset * 1 + index];

r0[2] = a[offset * 2 + index];

v0[0] = a[offset * 3 + index];

v0[1] = a[offset * 4 + index];

v0[2] = a[offset * 5 + index];

centr_body(r0, w_gpz);

f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0];

f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1];

f0[2] = w_gpz[2];

for (int n = 0; n < 7; n++)

{

for (int i = 0; i < 7; i++)

{

for (int j = 0; j < 3; j++)

{

r_i[j] = r0[j] + v0[j] * t_ever[i] + 0.5*f0[j] * t_ever[i] * t_ever[i] +

(A1[j] * t_ever[i] * t_ever[i] * t_ever[i]) / 6 +

(A2[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 12 +

(A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 20 +

(A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 30.0 +

(A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 42.0 +

(A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 56.0 +

(A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 72.0;

v_i[j] = v0[j] + f0[j] * t_ever[i] + A1[j] * t_ever[i] * t_ever[i] / 2 +

A2[j] * t_ever[i] * t_ever[i] * t_ever[i] / 3 +

A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 4 +

A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 5.0 +

A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 6.0 +

A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 7.0 +

A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 8.0;

}

centr_body(r_i, w_gpz);

for (int k = 0; k < 3; k++)

{

if (k == 0) { f_i[k] = w_gpz[k] + Om2*v_i[1] + OmEk*r_i[0]; }

if (k == 1) { f_i[k] = w_gpz[k] - Om2*v_i[0] + OmEk*r_i[1]; }

if (k == 2) { f_i[k] = w_gpz[k]; }

if (i == 0) { alpha1[k] = (f_i[k] - f0[k]) / t_ever[0]; }

if (i == 1) { alpha2[k] = ((f_i[k] - f0[k]) / t_ever[1] - alpha1[k]) / (t_ever[1] - t_ever[0]); }

if (i == 2) { alpha3[k] = (((f_i[k] - f0[k]) / t_ever[2] - alpha1[k]) / (t_ever[2] - t_ever[0]) - alpha2[k]) / (t_ever[2] - t_ever[1]); }

if (i == 3) { alpha4[k] = ((((f_i[k] - f0[k]) / t_ever[3] - alpha1[k]) / (t_ever[3] - t_ever[0]) - alpha2[k]) / (t_ever[3] - t_ever[1]) - alpha3[k]) / (t_ever[3] - t_ever[2]); }

if (i == 4) { alpha5[k] = (((((f_i[k] - f0[k]) / t_ever[4] - alpha1[k]) / (t_ever[4] - t_ever[0]) - alpha2[k]) / (t_ever[4] - t_ever[1]) - alpha3[k]) / (t_ever[4] - t_ever[2]) - alpha4[k]) / (t_ever[4] - t_ever[3]); }

if (i == 5) { alpha6[k] = ((((((f_i[k] - f0[k]) / t_ever[5] - alpha1[k]) / (t_ever[5] - t_ever[0]) - alpha2[k]) / (t_ever[5] - t_ever[1]) - alpha3[k]) / (t_ever[5] - t_ever[2]) - alpha4[k]) / (t_ever[5] - t_ever[3]) - alpha5[k]) / (t_ever[5] - t_ever[4]); }

if (i == 6) { alpha7[k] = (((((((f_i[k] - f0[k]) / t_ever[6] - alpha1[k]) / (t_ever[6] - t_ever[0]) - alpha2[k]) / (t_ever[6] - t_ever[1]) - alpha3[k]) / (t_ever[6] - t_ever[2]) - alpha4[k]) / (t_ever[6] - t_ever[3]) - alpha5[k]) / (t_ever[6] - t_ever[4]) - alpha6[k]) / (t_ever[6] - t_ever[5]); }

A1[k] = alpha1[k] + c_forward[1][0] * alpha2[k] + c_forward[2][0] * alpha3[k] + c_forward[3][0] * alpha4[k] + c_forward[4][0] * alpha5[k] + c_forward[5][0] * alpha6[k] + c_forward[6][0] * alpha7[k];

A2[k] = alpha2[k] + c_forward[2][1] * alpha3[k] + c_forward[3][1] * alpha4[k] + c_forward[4][1] * alpha5[k] + c_forward[5][1] * alpha6[k] + c_forward[6][1] * alpha7[k];

A3[k] = alpha3[k] + c_forward[3][2] * alpha4[k] + c_forward[4][2] * alpha5[k] + c_forward[5][2] * alpha6[k] + c_forward[6][2] * alpha7[k];

A4[k] = alpha4[k] + c_forward[4][3] * alpha5[k] + c_forward[5][3] * alpha6[k] + c_forward[6][3] * alpha7[k];

A5[k] = alpha5[k] + c_forward[5][4] * alpha6[k] + c_forward[6][4] * alpha7[k];

A6[k] = alpha6[k] + c_forward[6][5] * alpha7[k];

A7[k] = alpha7[k];

}

}

}

b[offset * 0 + index] = r0[0] + v0[0] * Step + 0.5*f0[0] * Step*Step + (A1[0] * Step*Step*Step) / 6 + (A2[0] * Step*Step*Step*Step) / 12 + (A3[0] * Step*Step*Step*Step*Step) / 20 + (A4[0] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[0] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[0] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[0] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;

b[offset * 1 + index] = r0[1] + v0[1] * Step + 0.5*f0[1] * Step*Step + (A1[1] * Step*Step*Step) / 6 + (A2[1] * Step*Step*Step*Step) / 12 + (A3[1] * Step*Step*Step*Step*Step) / 20 + (A4[1] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[1] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[1] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[1] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;

b[offset * 2 + index] = r0[2] + v0[2] * Step + 0.5*f0[2] * Step*Step + (A1[2] * Step*Step*Step) / 6 + (A2[2] * Step*Step*Step*Step) / 12 + (A3[2] * Step*Step*Step*Step*Step) / 20 + (A4[2] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[2] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[2] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[2] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;

b[offset * 3 + index] = v0[0] + f0[0] * Step + A1[0] * Step*Step / 2 + A2[0] * Step*Step*Step / 3 + A3[0] * Step*Step*Step*Step / 4 + A4[0] * Step*Step*Step*Step*Step / 5.0 + A5[0] * Step*Step*Step*Step*Step*Step / 6.0 + A6[0] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[0] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;

b[offset * 4 + index] = v0[1] + f0[1] * Step + A1[1] * Step*Step / 2 + A2[1] * Step*Step*Step / 3 + A3[1] * Step*Step*Step*Step / 4 + A4[1] * Step*Step*Step*Step*Step / 5.0 + A5[1] * Step*Step*Step*Step*Step*Step / 6.0 + A6[1] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[1] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;

b[offset * 5 + index] = v0[2] + f0[2] * Step + A1[2] * Step*Step / 2 + A2[2] * Step*Step*Step / 3 + A3[2] * Step*Step*Step*Step / 4 + A4[2] * Step*Step*Step*Step*Step / 5.0 + A5[2] * Step*Step*Step*Step*Step*Step / 6.0 + A6[2] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[2] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;

}

}

This is numerical integration by Everhart method, where centr_body is device function. I use gtx 960 with 3 gb

A few more things. Putting those variables into __shared__ like you did will break as soon as there is more than 1 thread in a thread block, unless you also introduce a *thread specific* indexing into each shared memory variable. It is highly likely you will run out of shared memory before you can reach a reasonable number of threads in a block with this method.

Defining your arrays with local scope (like in the commented out version) and indexing them in a for loop like this...

...forces all these arrays to reside in local memory. And that is going to be painfully slow. A #pragma unroll statement in front such loops defining indices will allow the arrays to live in registers instead (the index variable will be known at compilation time).

My recommendation is: forget about the __shared__ memory, unroll all your indexing loops.

Is this some sort of machine-generated code? I would suggest arranging the polynomial evaluations according to the Horner scheme. I am not familiar with this (complex) Everhart method of integration, are you sure it is superior to simpler integration algorithms?

Everhart, Edgar. "Implicit single-sequence methods for integrating orbits."

Celestial Mechanics and Dynamical Astronomy10.1 (1974): 35-55.Good general advice, but this would tend to increase register pressure which OP claims is a problem already.

I think (haven't checked) that writing the polynomials in the form of serial multiplies creates many temporary variables through CSE, which is why trying nested multiplication (i.e. Horner scheme) should be advantageous and is likely to also improve accuracy. E.g.:

(A7[j]/72.0 * t_ever[i] + A6[j]/56.0) * t_ever[i] + ...

Seeing screenshots of the profiling results would be helpful

My reading on Everhart integration suggests use of double-precision arithmetic (i.e. cu_real = double) since the algorithm seems to be targeted at cases where accuracy if already a problem. In which case use of DP on a consumer GPU would be the biggest issue.

The code as posted doesn't compile because of this:

I hacked up the code so it compiles. Using CUDA 8.0, with cu_real = float, compiled for sm_50, I see:

(1) Code as-is: 100 registers

(2) as (1), but with floating-point literals as 'float' constants (with 'f' suffix): 92 registers

(3) as (2), plus __restrict__ on kernel pointer arguments: 87 registers

(4) as (3), but with -prec-div=false: 80 registers

(5) as (3), plus 'unroll 3' on innermost loops: 120 registers

(6) as (3), plus unrolling of all fixed-iteration loops: 104 registers

The actual register usage is likely a bit higher due to omitted code. If cu_real = double, it would be higher still.

Use nested multiplication for all polynomials to reduce the number of double-precision operations needed. With nested multiplication (Horner scheme), FMAs will be used instead of separate multiplies and adds. Also consider explicitly converting the divisions in these into multiplications with the reciprocals, as this will save a lot of double-precision operations. To keep the computation flexible with regard to type, I would suggest pulling out the weights into separate 'const' variables. E.g.:

By all means, try adding 'const' and '__restrict__' attributes to the pointer arguments to the kernel as appropriate. This will give the compiler greater flexibility in scheduling loads. For cu_real = double, I see register usage at 160 for the code as-is, and 152 when __restrict__ pointers are used. When also changing the polynomials in the way shown above the register usage reduces to 146.

p.s. The problem which i try to solve need only double precision because of the calcualtion accuracy. I wrote the code when i used Tesla k40 for calculation. Now i havne't this powerful device and try to use just what i have - gtx 960).

Porfiler showes 173 registers without shared memeory and 114 with it. Occupancy with registers is 12.3 and occupancy with shared memeory 25. Occupancy is closely connected with number of registers per thread. Interesting that computetion process with higher occupancy (with shared memeory) is slower.

With regard to register pressure, I am not sure this is the biggest performance bottleneck. I still think you would want to minimize the number of double-precision operations first. Are you using CUDA 8? Different compiler versions may be tuned differently with regard to aggressive scheduling of loads, for example, and this has an impact on register pressure.

Occupancy is only loosely coupled with performance. See Vasily Volkov's "Better Performance at Lower Occupancy": http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf. Shared memory access is slower than register access.