shared memory vs registers

mihazloy · July 26, 2017, 2:30pm

Hi, everybody. I have a such problem. There are a lot of registers in my kernel, so when i profiled my programm i discovered low multiprocessor occupancy because of too much registers usage per thread. As na experiment i’v just tried to use shared memory instead of registers. After that profiler showed higher occupancy but the programm works slower with wrong results. I don’t konow is it possible to use shared memory for local variables or there is another way to increase the occupancy?
Thanks

njuffa · July 26, 2017, 2:42pm

Maybe, but it is impossible to tell whether that is so or not without looking at the code in question. One thing to look for is the unnecessary use of expensive math functions, such as pow(). If you use an old GPU, switching to one with a newer architecture that provides more registers may be advisable.

It is possible to use shared memory for thread-local variables, but you will not be able to store much thread-local data this way, and it will require a pointer to the storage which eats at least one register. Also, access to shared memory isn’t as fast as a register access and if occupancy is low (you did not indicate how low), there may not be enough independent work to cover the additional latency.

Note that occupancy is only loosely correlated with performance.

mihazloy · July 26, 2017, 3:01pm

The code is:
global void Everhart_device_centr_body(int id_ko_array, cu_real a, cu_real b, cu_real t_tek, const int offset, cu_real Step)
{
int index = threadIdx.x + blockIdx.xblockDim.x;
if (index < offset)
{
// cu_real w_gpz[3] = { 0,0,0 };
// cu_real r0[3] = { 0,0,0 };
// cu_real v0[3] = { 0,0,0 };
// cu_real r_i[3] = { 0,0,0 };
// cu_real v_i[3] = { 0,0,0 };
// cu_real f0[3] = { 0,0,0 };
// cu_real f_i[3] = { 0,0,0 };
// cu_real A1[3] = { 0,0,0 };
// cu_real A2[3] = { 0,0,0 };
// cu_real A3[3] = { 0,0,0 };
// cu_real A4[3] = { 0,0,0 };
// cu_real A5[3] = { 0,0,0 };
// cu_real A6[3] = { 0,0,0 };
// cu_real A7[3] = { 0,0,0 };
// cu_real alpha1[3] = { 0,0,0 };
// cu_real alpha2[3] = { 0,0,0 };
// cu_real alpha3[3] = { 0,0,0 };
// cu_real alpha4[3] = { 0,0,0 };
// cu_real alpha5[3] = { 0,0,0 };
// cu_real alpha6[3] = { 0,0,0 };
// cu_real alpha7[3] = { 0,0,0 };
// cu_real c_forward[7][7];
// cu_real t_ever[7] = {0.056262560526922146 Step, 0.180240691736892364 Step, 0.352624717113169* Step, 0.547153626330555383* Step, 0.734210177215410531* Step, 0.885320946839095768* Step, 0.977520613561287501* Step};

	__shared__ cu_real w_gpz[3];
	__shared__ cu_real r0[3];
	__shared__ cu_real v0[3];
	__shared__ cu_real r_i[3];
	__shared__ cu_real v_i[3];
	__shared__ cu_real f_i[3];
	__shared__ cu_real f0[3];
	__shared__ cu_real A1[3];
	__shared__ cu_real A2[3];
	__shared__ cu_real A3[3];
	__shared__ cu_real A4[3];
	__shared__ cu_real A5[3];
	__shared__ cu_real A6[3];
	__shared__ cu_real A7[3];
	__shared__ cu_real alpha1[3];
	__shared__ cu_real alpha2[3];
	__shared__ cu_real alpha3[3];
	__shared__ cu_real alpha4[3];
	__shared__ cu_real alpha5[3];
	__shared__ cu_real alpha6[3];
	__shared__ cu_real alpha7[3];
	__shared__ cu_real t_ever[7];
	__shared__ cu_real c_forward[7][7];
	t_ever[0] =  0.056262560526922146* Step;
	t_ever[1] = 0.180240691736892364* Step;
	t_ever[2] = 0.352624717113169* Step;
	t_ever[3] = 0.547153626330555383* Step;
	t_ever[4] = 0.734210177215410531* Step;
	t_ever[5] = 0.885320946839095768* Step;
	t_ever[6] = 0.977520613561287501* Step;
	for (int i =0; i < 3; i++)
	{
		A1[i] = 0;
		A2[i] = 0;
		A3[i] = 0;
		A4[i] = 0;
		A5[i] = 0;
		A6[i] = 0;
		A7[i] = 0;
		alpha1[i] = 0;
		alpha2[i] = 0;
		alpha3[i] = 0;
		alpha4[i] = 0;
		alpha5[i] = 0;
		alpha6[i] = 0;
		alpha7[i] = 0;
		r0[i] = 0;
		v0[i] = 0;
		w_gpz[i] = 0;
		r_i[i] = 0;
		v_i[i] = 0;
		f_i[i] = 0;
		f0[i] = 0;
	}		

	for (int i = 0; i < 7; i++)
	{
		for (int j = 0; j <= i; j++)
		{
			if (i == j)
			{
				c_forward[i][j] = 1.0;
			}
			else
			{
				if (j == 0)
				{
					c_forward[i][j] = -t_ever[i - 1] * c_forward[i - 1][0];
				}
				else
				{
					c_forward[i][j] = c_forward[i - 1][j - 1] - t_ever[i - 1] * c_forward[i - 1][j];

				}
			}
		}
	}
	r0[0] = a[offset * 0 + index];
	r0[1] = a[offset * 1 + index];
	r0[2] = a[offset * 2 + index];
	v0[0] = a[offset * 3 + index];
	v0[1] = a[offset * 4 + index];
	v0[2] = a[offset * 5 + index];
	centr_body(r0, w_gpz);
	f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0];
	f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1];
	f0[2] = w_gpz[2];
	for (int n = 0; n < 7; n++)
	{
		for (int i = 0; i < 7; i++) 
		{
			for (int j = 0; j < 3; j++) 
			{
				r_i[j] = r0[j] + v0[j] * t_ever[i] + 0.5*f0[j] * t_ever[i] * t_ever[i] + 
					(A1[j] * t_ever[i] * t_ever[i] * t_ever[i]) / 6 + 
					(A2[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 12 + 
					(A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 20 + 
					(A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 30.0 + 
					(A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 42.0 + 
					(A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 56.0 + 
					(A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i]) / 72.0;
				v_i[j] = v0[j] + f0[j] * t_ever[i] + A1[j] * t_ever[i] * t_ever[i] / 2 +
					A2[j] * t_ever[i] * t_ever[i] * t_ever[i] / 3 +
					A3[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 4 + 
					A4[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 5.0 + 
					A5[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 6.0 + 
					A6[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 7.0 + 
					A7[j] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] * t_ever[i] / 8.0;
			}
			centr_body(r_i, w_gpz);
			for (int k = 0; k < 3; k++)
			{

				if (k == 0) { f_i[k] = w_gpz[k] + Om2*v_i[1] + OmEk*r_i[0]; }
				if (k == 1) { f_i[k] = w_gpz[k] - Om2*v_i[0] + OmEk*r_i[1]; }
				if (k == 2) { f_i[k] = w_gpz[k]; }
				if (i == 0) { alpha1[k] = (f_i[k] - f0[k]) / t_ever[0]; }
				if (i == 1) { alpha2[k] = ((f_i[k] - f0[k]) / t_ever[1] - alpha1[k]) / (t_ever[1] - t_ever[0]); }
				if (i == 2) { alpha3[k] = (((f_i[k] - f0[k]) / t_ever[2] - alpha1[k]) / (t_ever[2] - t_ever[0]) - alpha2[k]) / (t_ever[2] - t_ever[1]); }
				if (i == 3) { alpha4[k] = ((((f_i[k] - f0[k]) / t_ever[3] - alpha1[k]) / (t_ever[3] - t_ever[0]) - alpha2[k]) / (t_ever[3] - t_ever[1]) - alpha3[k]) / (t_ever[3] - t_ever[2]); }
				if (i == 4) { alpha5[k] = (((((f_i[k] - f0[k]) / t_ever[4] - alpha1[k]) / (t_ever[4] - t_ever[0]) - alpha2[k]) / (t_ever[4] - t_ever[1]) - alpha3[k]) / (t_ever[4] - t_ever[2]) - alpha4[k]) / (t_ever[4] - t_ever[3]); }
				if (i == 5) { alpha6[k] = ((((((f_i[k] - f0[k]) / t_ever[5] - alpha1[k]) / (t_ever[5] - t_ever[0]) - alpha2[k]) / (t_ever[5] - t_ever[1]) - alpha3[k]) / (t_ever[5] - t_ever[2]) - alpha4[k]) / (t_ever[5] - t_ever[3]) - alpha5[k]) / (t_ever[5] - t_ever[4]); }
				if (i == 6) { alpha7[k] = (((((((f_i[k] - f0[k]) / t_ever[6] - alpha1[k]) / (t_ever[6] - t_ever[0]) - alpha2[k]) / (t_ever[6] - t_ever[1]) - alpha3[k]) / (t_ever[6] - t_ever[2]) - alpha4[k]) / (t_ever[6] - t_ever[3]) - alpha5[k]) / (t_ever[6] - t_ever[4]) - alpha6[k]) / (t_ever[6] - t_ever[5]); }
				A1[k] = alpha1[k] + c_forward[1][0] * alpha2[k] + c_forward[2][0] * alpha3[k] + c_forward[3][0] * alpha4[k] + c_forward[4][0] * alpha5[k] + c_forward[5][0] * alpha6[k] + c_forward[6][0] * alpha7[k];
				A2[k] = alpha2[k] + c_forward[2][1] * alpha3[k] + c_forward[3][1] * alpha4[k] + c_forward[4][1] * alpha5[k] + c_forward[5][1] * alpha6[k] + c_forward[6][1] * alpha7[k];
				A3[k] = alpha3[k] + c_forward[3][2] * alpha4[k] + c_forward[4][2] * alpha5[k] + c_forward[5][2] * alpha6[k] + c_forward[6][2] * alpha7[k];
				A4[k] = alpha4[k] + c_forward[4][3] * alpha5[k] + c_forward[5][3] * alpha6[k] + c_forward[6][3] * alpha7[k];
				A5[k] = alpha5[k] + c_forward[5][4] * alpha6[k] + c_forward[6][4] * alpha7[k];
				A6[k] = alpha6[k] + c_forward[6][5] * alpha7[k];
				A7[k] = alpha7[k];
			}
		}
	}
	b[offset * 0 + index] = r0[0] + v0[0] * Step + 0.5*f0[0] * Step*Step + (A1[0] * Step*Step*Step) / 6 + (A2[0] * Step*Step*Step*Step) / 12 + (A3[0] * Step*Step*Step*Step*Step) / 20 + (A4[0] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[0] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[0] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[0] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
	b[offset * 1 + index] = r0[1] + v0[1] * Step + 0.5*f0[1] * Step*Step + (A1[1] * Step*Step*Step) / 6 + (A2[1] * Step*Step*Step*Step) / 12 + (A3[1] * Step*Step*Step*Step*Step) / 20 + (A4[1] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[1] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[1] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[1] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
	b[offset * 2 + index] = r0[2] + v0[2] * Step + 0.5*f0[2] * Step*Step + (A1[2] * Step*Step*Step) / 6 + (A2[2] * Step*Step*Step*Step) / 12 + (A3[2] * Step*Step*Step*Step*Step) / 20 + (A4[2] * Step*Step*Step*Step*Step*Step) / 30.0 + (A5[2] * Step*Step*Step*Step*Step*Step*Step) / 42.0 + (A6[2] * Step*Step*Step*Step*Step*Step*Step*Step) / 56.0 + (A7[2] * Step*Step*Step*Step*Step*Step*Step*Step*Step) / 72.0;
	b[offset * 3 + index] = v0[0] + f0[0] * Step + A1[0] * Step*Step / 2 + A2[0] * Step*Step*Step / 3 + A3[0] * Step*Step*Step*Step / 4 + A4[0] * Step*Step*Step*Step*Step / 5.0 + A5[0] * Step*Step*Step*Step*Step*Step / 6.0 + A6[0] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[0] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
	b[offset * 4 + index] = v0[1] + f0[1] * Step + A1[1] * Step*Step / 2 + A2[1] * Step*Step*Step / 3 + A3[1] * Step*Step*Step*Step / 4 + A4[1] * Step*Step*Step*Step*Step / 5.0 + A5[1] * Step*Step*Step*Step*Step*Step / 6.0 + A6[1] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[1] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
	b[offset * 5 + index] = v0[2] + f0[2] * Step + A1[2] * Step*Step / 2 + A2[2] * Step*Step*Step / 3 + A3[2] * Step*Step*Step*Step / 4 + A4[2] * Step*Step*Step*Step*Step / 5.0 + A5[2] * Step*Step*Step*Step*Step*Step / 6.0 + A6[2] * Step*Step*Step*Step*Step*Step*Step / 7.0 + A7[2] * Step*Step*Step*Step*Step*Step*Step*Step / 8.0;
}

}

This is numerical integration by Everhart method, where centr_body is device function. I use gtx 960 with 3 gb

cbuchner1 · July 26, 2017, 3:45pm

For good measure, throw in some #pragma unroll statements into loops that have a known (and low) trip count. Does it improve things? It will get rid of a lot of the required indexing computations.

#pragma unroll 3
for (int i =0; i < 3; i++)

A few more things. Putting those variables into shared like you did will break as soon as there is more than 1 thread in a thread block, unless you also introduce a thread specific indexing into each shared memory variable. It is highly likely you will run out of shared memory before you can reach a reasonable number of threads in a block with this method.

Defining your arrays with local scope (like in the commented out version) and indexing them in a for loop like this…

cu_real A1[3];
for (int i =0; i < 3; i++)
{
  // access A1[i]

…forces all these arrays to reside in local memory. And that is going to be painfully slow. A #pragma unroll statement in front such loops defining indices will allow the arrays to live in registers instead (the index variable will be known at compilation time).

My recommendation is: forget about the shared memory, unroll all your indexing loops.

njuffa · July 26, 2017, 3:57pm

What type is ‘cu_real’? If it is ‘float’ try compiling -prec-div=false and make sure all floating-point literal constants have a trailing ‘f’ suffix, otherwise your computation will be promoted to ‘double’, making it slow on this consumer GPU.

Is this some sort of machine-generated code? I would suggest arranging the polynomial evaluations according to the Horner scheme. I am not familiar with this (complex) Everhart method of integration, are you sure it is superior to simpler integration algorithms?

njuffa · July 26, 2017, 4:16pm

Does this describe the integration method used?

Everhart, Edgar. “Implicit single-sequence methods for integrating orbits.” Celestial Mechanics and Dynamical Astronomy 10.1 (1974): 35-55.

Good general advice, but this would tend to increase register pressure which OP claims is a problem already.

I think (haven’t checked) that writing the polynomials in the form of serial multiplies creates many temporary variables through CSE, which is why trying nested multiplication (i.e. Horner scheme) should be advantageous and is likely to also improve accuracy. E.g.:

(A7[j]/72.0 * t_ever[i] + A6[j]/56.0) * t_ever[i] + …

cbuchner1 · July 26, 2017, 4:29pm

I doubt OP’s problem is register pressure alone. There appears to be an excessive amount of local memory access in the code - and the attempted fix using shared memory is not going to work as is.

Seeing screenshots of the profiling results would be helpful

njuffa · July 26, 2017, 4:37pm

He mentioned the profiler, that’s all we know at this point. We don’t know what the register usage actually is or what compiler options were used to build the code. The compiler’s unroller may or may not have unrolled the loops in question automatically.

My reading on Everhart integration suggests use of double-precision arithmetic (i.e. cu_real = double) since the algorithm seems to be targeted at cases where accuracy if already a problem. In which case use of DP on a consumer GPU would be the biggest issue.

The code as posted doesn’t compile because of this:

centr_body(r0, w_gpz);
f0[0] = w_gpz[0] + Om2*v0[1] + OmEk*r0[0];
f0[1] = w_gpz[1] - Om2*v0[0] + OmEk*r0[1];

I hacked up the code so it compiles. Using CUDA 8.0, with cu_real = float, compiled for sm_50, I see:

(1) Code as-is: 100 registers
(2) as (1), but with floating-point literals as ‘float’ constants (with ‘f’ suffix): 92 registers
(3) as (2), plus restrict on kernel pointer arguments: 87 registers
(4) as (3), but with -prec-div=false: 80 registers
(5) as (3), plus ‘unroll 3’ on innermost loops: 120 registers
(6) as (3), plus unrolling of all fixed-iteration loops: 104 registers

The actual register usage is likely a bit higher due to omitted code. If cu_real = double, it would be higher still.

mihazloy · July 26, 2017, 5:41pm

njuffa is right. I run this kernel with cu_real = double and profiler showes me about 150 registers per thread. I don’t remember exactly number. centr_body is device function but you need do sum calculations for w_gpz[0] = -mu_er[0]/(modRmodR*modR). I’m sorry but I can show the results of profiling only tomorrow.

njuffa · July 26, 2017, 5:58pm

With cu_real=double I would assume the code is limited by the throughput of the double-precision units (as I recall, 1/64 the throughput of the single-precision units on the GPU used), rather than occupancy.

Use nested multiplication for all polynomials to reduce the number of double-precision operations needed. With nested multiplication (Horner scheme), FMAs will be used instead of separate multiplies and adds. Also consider explicitly converting the divisions in these into multiplications with the reciprocals, as this will save a lot of double-precision operations. To keep the computation flexible with regard to type, I would suggest pulling out the weights into separate ‘const’ variables. E.g.:

cu_real tmp = t_ever[i];
const cu_real aw0 = 1.0 / 2.0;
const cu_real aw1 = 1.0 / 6.0;
const cu_real aw2 = 1.0 / 12.0;
const cu_real aw3 = 1.0 / 20.0;
const cu_real aw4 = 1.0 / 30.0;
const cu_real aw5 = 1.0 / 42.0;
const cu_real aw6 = 1.0 / 56.0;
const cu_real aw7 = 1.0 / 72.0;
r_i[j] = (((((((((A7[j] * aw7) * tmp + A6[j] * aw6) * tmp + A5[j] * aw5) * tmp + 
                A4[j] * aw4) * tmp + A3[j] * aw3) * tmp + A2[j] * aw2) * tmp + 
              A1[j] * aw1) * tmp + aw0 * f0[j]) * tmp + v0[j]) * tmp + r0[j];

By all means, try adding ‘const’ and ‘restrict’ attributes to the pointer arguments to the kernel as appropriate. This will give the compiler greater flexibility in scheduling loads. For cu_real = double, I see register usage at 160 for the code as-is, and 152 when restrict pointers are used. When also changing the polynomials in the way shown above the register usage reduces to 146.

mihazloy · July 27, 2017, 6:22am

Thanks everubody for your comments, I’ll try to rewrite my code according your advices. But I don’t understand one thing why when I replace registers by shared memeory I get wrong results. If i understand right there is no crossing between threads in my code and so i have no need in synchronization. Whats wrong with shared memory. To be more correct wrong results i get only when i use shared memory for variables f_i or r_i. Other variables work with shared memory correct.
p.s. The problem which i try to solve need only double precision because of the calcualtion accuracy. I wrote the code when i used Tesla k40 for calculation. Now i havne’t this powerful device and try to use just what i have - gtx 960).
Porfiler showes 173 registers without shared memeory and 114 with it. Occupancy with registers is 12.3 and occupancy with shared memeory 25. Occupancy is closely connected with number of registers per thread. Interesting that computetion process with higher occupancy (with shared memeory) is slower.

njuffa · July 27, 2017, 6:50am

If your access to shared memory is exclusive between each thread, it should work without synchronization. So you would index shmem_f_i[index*3 + i] instead of the thread-local f_i[i]. Note that depending on your block size, shared memory could be exhausted fairly quickly. f_i comprises three ‘double’ values of 8 bytes each, so 24 bytes per thread.

With regard to register pressure, I am not sure this is the biggest performance bottleneck. I still think you would want to minimize the number of double-precision operations first. Are you using CUDA 8? Different compiler versions may be tuned differently with regard to aggressive scheduling of loads, for example, and this has an impact on register pressure.

Occupancy is only loosely coupled with performance. See Vasily Volkov’s “Better Performance at Lower Occupancy”: http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf. Shared memory access is slower than register access.

mihazloy · July 27, 2017, 7:09am

I’m using CUDA 8. I’v thought about decreasing of double-precision operations especcialy when there is no any need in them. I need to read about Horner schemes for better understanding how to use it in my code. Maybe there is another way to parallelize the propagation of many space objects using Everhart, though I have a good speed in compare with CPU. I’m just confused about occupancy (12.5 is too low) but you’v said that occupancy is only loosely correlated with performance.

njuffa · July 27, 2017, 7:19am

In the context of polynomial evaluation Horner’s scheme is simply a fancy name for nested multiplication, as shown in my example in #10. The Wikipedia article is already overly complicated for what is needed here. Horner’s scheme typically maximizes accuracy and maps well to FMA (fused multiply-add) operations, which are the workhorse of GPU floating-point computation. If we needed more instruction-level parallelism one could use a second-order Horner scheme but this would increase register pressure as would other evaluation schemes such as Estrin’s.

mihazloy · July 27, 2017, 8:03am

Thanks for Horner schemes. They increased speed of calculation about 2 times. I’m realy glad for this help.

cbuchner1 · July 27, 2017, 8:31am

What would really increase your speed is switching to one of those Tesla cards, or the original Kepler based Geforce Titan (GK110) or Titan-Z (dual GPU) which featured very high double precision throughput at 1/3rd the level of single precision throughput.

Christian

BulatZiganshin · July 27, 2017, 4:06pm

amount of shared memory on GPU is ~1/4 of register pool. so it hardly can help without sharing data between threads. instead, you can try to offload constants to the constant memory

cbuchner1 · July 27, 2017, 4:32pm

The idea of distributing the information across threads is actually a good one. I’ve used this strategy to accelerate some hash functions on consumer Kepler GPUs that had trouble keeping the full hash state in a single thread due to register pressure. So I spread the state across 4 consecutive threads and got a decent speedup.

The unfortunate part is that your arrays have a dimension of 3 and 7 which is not a power of 2 - so it requires a bit of padding (losing a bit of efficiency) to be compatible with the warp shuffle instructions.

If you could post your latest code I could have a closer look.

njuffa · July 27, 2017, 4:46pm

The constant memory is 64K, so not significantly larger than shared memory. And since it is designed for broadcast access across an entire warp, efficient use requires uniform access (or at most 2-3 different addresses across a warp, i.e. minor serialization), which basically amount to inter-thread sharing.

The compiler moves compile-time constants (user specified as well as compiler generated ones) into constant memory on its own. So I am not sure what specific data on top of that we could / would manually move into constant memory in this code?

I note that register pressure is significantly lower (100 registers) with the code as originally written with all arrays in local memory plus complete loop unrolling (as suggested by cbuchner in #4, we have come full circle on this :-), instead of the incorrect version using arrays in shared memory that was posted. So that would seem the way to go.

mihazloy · July 28, 2017, 6:17am

I think about constant memerory because the real version of code contains not such simple device function as cent_body but more complicated (accelirations from Earht gravitation field with nxn harmonics and accelirations from atmocpheric drag and etc.). So additional arrays with constant variables appeare in the code. Each thread uses these arrays and it would be better to allocate them as constant variables in constant memory. Idea of cbuchner1 to combine threads for calculation is interesting too, but it requires complete changing of the code.
Thanks