Global memory vs register storage How to force the compiler to use registers?

In a device function, I am allocating some local storage.

struct mystruct

{

   ushort2 xy;

   float2 st;

};

__device__ void foo()

{

   mystruct local[10];

   ...

}

When I compile using nvcc 2.2, the storage is clearly allocated as global device memory. Using compiler option -Xptxas=-v I can see the lmem storage vary as I change the size of the array.

But I don’t want to use global device memory because it’s too slow for frequent access. Instead I want to use registers. My target device is Tesla S1070 and I believe I have registers to spare.

How can I encourage the compiler to use registers rather than slow device memory for this array?

Changing data type and alignment does not seem to help.

The general problem I am having is that each thread needs one or two hundred bytes of read-write memory. Shared memory is too small and global memory is too slow. I guess this must be a frequent problem – what are standard solutions please?

Note: I have been blocked on this issue for some time and would really appreciate some help.

Thanks!

I don’t know how to control register and how to force the compiler use register neither local memory.

The compiler by itself will decide with memory will be use, in your case your structure and array take alot of memory, so i think compiler decided use local memory to allocate your array.

The number of register is limited, 8192 for all thread on each block.

for instance, if you have 256 threads per block. so 1 thread only has 32 registers to use.

consider with your case, assuming that you have 256 threads per block. sizeof(mystruct) * 10 = 30 registers, so how can the kernel execute another purpose with only 2 registers.

Even if I reduce the size of the array, as in…

__device__ void foo()

{

   mystruct local[1];

}

… the array is still placed in local memory. Also, reducing the number of threads per block has no effect.

I need a way to tell the compiler to use registers. The ‘register’ keyword is ignored.

__device__ void foo()

{

   mystruct local[1];

}

I am not sure the compiler will use register for allocating your “local”

but if you chance like this, I think compiler will use register to allocate your xy and tp variables;

__device__ void foo()

{

	ushort2 xy;

	float2 tp;

}

of course two above methods is not realy same.

I have recently read this explanation:

"The compiler cannot determine if an array is indexed with constant quantities. Please note that registers are not addressable so an array has to go into local memory – even if it is a two-element array – when the addressing of the array is not known at compile time. "

From:

http://www.ddj.com/hpc-high-performance-computing/215900921

Actually it sounds reasonable.

So you need to declare

mystruct local1;

mystruct local2;

mystruct local3;

In this way they shoud go to registers. Either I think you need to perform all the addressing of the array at compile time, i.e., using templates. In such a case the compiler MAY be able to optimize to registers.

I am also trying force my kernel to use more registers than shared memory (if it isnt already). I tried looking for code/functions to specifically declare to use registers instead of shared memory, but it seems that this is something that cannot be specified by the programmer but rather automatically by nvcc.

When I run deviceQuery, It shows:

Total amount of constant memory:			   65536 bytes

Total amount of shared memory per block:	   16384 bytes

Total number of registers available per block: 16384

seems that I have double the registers as you…if we are running the same kernel, would nvcc make use of all 16384 or still just use 8192?

When I compile the code with -cubin option and examine the .cubin file, the output:

architecture {sm_10}

abiversion {0}

modname {cubin}

consts  {

	name = d_k_fir_fff_taps

	segname = const

	segnum = 0

	offset = 0

	bytes = 65536

}

code  {

	name = __globfunc__Z40cuda_fir_fff_filter_7_kernel_short_decimPKfPf

jjj

	lmem = 0

	smem = 48

	reg = 11

	bar = 1

	bincode  {

shows only 11 registers?? why??

If I decrease the shared memory allocation, would more register be used? Can anyone please shed some light on this?

Thanks

Cards with compute capability >= 1.2 have 16384 registers, the older ones (1.0 , 1.1) have 8192. (APPENDIX A CUDA programming guide)

Also decreasing shared memory should not effect your register usage, but it may increase the occupancy of the multiprocessor “if” you number of thread blocks per sm… were limited by shared memory in first place.

The “11” is the numbers of register per thread nvcc is using.

(check cuda occupancy calculator spreadsheet)

NA