Kernels with heavy register spills Tricks to improve memory throughput?
Hi,

GPU Tesla 2070

I have a kernel which is quite big and results in lot of spills -
[b]568 bytes stack frame, 1028 bytes spill stores, 1116 bytes spill loads[/b]

This results in many local loads and stores - and also heavy dependence on compilers performance on making these loads and stores coalesced.
Since I dont know enough about the nvcc compiler I really dont know if it is doing a good job or not while managing these spilled loads and stores (which are now in global mem).
The application is such that major part of the global mem transaction comes from these spilled ld/st thus managing them effectively becomes very important - in my opinion.

Requested Global Load Throughput(bytes/sec) -- 385659128
Global Load Throughput(bytes/sec) -- 12584594915
Requested Global Store Throughput(bytes/sec) -- 296805
Global Store Throughput(bytes/sec) -- 732814299
Global load efficiency - about 3%
Global store efficiency - .04%

As we can see currently I achieve extremely low global ld/st efficiency which indicated extremely wasteful utilization of bandwidth - am i correct?

What are the possible ways to improve this situation?
- Coding methods?
- Shifting some variables to into shared mem (how - based on their bigger live range etc?)
- Explicitly declare some internal variables as global so that I can manage their coalesced access?

Would like to know your thoughts on these..

Thanks!
Sid.
Hi,



GPU Tesla 2070



I have a kernel which is quite big and results in lot of spills -

568 bytes stack frame, 1028 bytes spill stores, 1116 bytes spill loads



This results in many local loads and stores - and also heavy dependence on compilers performance on making these loads and stores coalesced.

Since I dont know enough about the nvcc compiler I really dont know if it is doing a good job or not while managing these spilled loads and stores (which are now in global mem).

The application is such that major part of the global mem transaction comes from these spilled ld/st thus managing them effectively becomes very important - in my opinion.



Requested Global Load Throughput(bytes/sec) -- 385659128

Global Load Throughput(bytes/sec) -- 12584594915

Requested Global Store Throughput(bytes/sec) -- 296805

Global Store Throughput(bytes/sec) -- 732814299

Global load efficiency - about 3%

Global store efficiency - .04%



As we can see currently I achieve extremely low global ld/st efficiency which indicated extremely wasteful utilization of bandwidth - am i correct?



What are the possible ways to improve this situation?

- Coding methods?

- Shifting some variables to into shared mem (how - based on their bigger live range etc?)

- Explicitly declare some internal variables as global so that I can manage their coalesced access?



Would like to know your thoughts on these..



Thanks!

Sid.

#1
Posted 04/23/2012 09:05 AM   
Are you sure your reads and writes are coalesced?
Does each 32 threads request continuous locations in memory?
As a general rule your reads and writes should be (base_address+tid) to be coalesced.

I don't think it is a compiler problem.
Are you sure your reads and writes are coalesced?

Does each 32 threads request continuous locations in memory?

As a general rule your reads and writes should be (base_address+tid) to be coalesced.



I don't think it is a compiler problem.

#2
Posted 04/23/2012 09:12 AM   
[quote name='apostglen46' date='23 April 2012 - 03:42 PM' timestamp='1335172365' post='1399746']
Are you sure your reads and writes are coalesced?
Does each 32 threads request continuous locations in memory?
As a general rule your reads and writes should be (base_address+tid) to be coalesced.

I don't think it is a compiler problem.
[/quote]

As I said, the major portion of my global ld/st are coming disguised as 'local'. Which means that the registers spilled and some other data structures have automatically been placed into global memory by the compiler and I (as per my current implementation) have no control on them - I cannot apply the coalesced read/write rules like the one you mentioned on those local ld/st - it is probably compilers job.

Correct me if I am wrong in my understanding.
[quote name='apostglen46' date='23 April 2012 - 03:42 PM' timestamp='1335172365' post='1399746']

Are you sure your reads and writes are coalesced?

Does each 32 threads request continuous locations in memory?

As a general rule your reads and writes should be (base_address+tid) to be coalesced.



I don't think it is a compiler problem.





As I said, the major portion of my global ld/st are coming disguised as 'local'. Which means that the registers spilled and some other data structures have automatically been placed into global memory by the compiler and I (as per my current implementation) have no control on them - I cannot apply the coalesced read/write rules like the one you mentioned on those local ld/st - it is probably compilers job.



Correct me if I am wrong in my understanding.

#3
Posted 04/23/2012 09:19 AM   
Local memory is laid out differently so accesses are always coalesced (unless threads access "automatic" arrays each using a different index).
Local memory is laid out differently so accesses are always coalesced (unless threads access "automatic" arrays each using a different index).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 04/23/2012 09:43 AM   
[quote name='tera' date='23 April 2012 - 04:13 PM' timestamp='1335174209' post='1399756']
Local memory is laid out differently so accesses are always coalesced (unless threads access "automatic" arrays each using a different index).
[/quote]

So suppose I have two arrays declared in my kernel.

double X[4],Y[4];

variable X is used in a loop which has been completely unrolled - which results in fixed index value based access at compile time.
Variable Y is used in a complex (maybe nested) loop and has not be unrolled.

a) would it mean that compiler would attempt to allocate regs for variable X which may/maynot spill onto local memory AND also its access would be coalesced even when it spills

b) For variable Y - it would be a case of "automatic" indexed array - going directly to local mem and probably not coalesced?

Am I understanding it correctly?

Are we sure that in general cases local ld/st are coalesced ( I would assume so - compiler should be able to do this)? Do you happen to remember where this been stated by Nvidia?

So you should probably be focusing on variables of type Y. How do I identify them with surety?
[quote name='tera' date='23 April 2012 - 04:13 PM' timestamp='1335174209' post='1399756']

Local memory is laid out differently so accesses are always coalesced (unless threads access "automatic" arrays each using a different index).





So suppose I have two arrays declared in my kernel.



double X[4],Y[4];



variable X is used in a loop which has been completely unrolled - which results in fixed index value based access at compile time.

Variable Y is used in a complex (maybe nested) loop and has not be unrolled.



a) would it mean that compiler would attempt to allocate regs for variable X which may/maynot spill onto local memory AND also its access would be coalesced even when it spills



b) For variable Y - it would be a case of "automatic" indexed array - going directly to local mem and probably not coalesced?



Am I understanding it correctly?



Are we sure that in general cases local ld/st are coalesced ( I would assume so - compiler should be able to do this)? Do you happen to remember where this been stated by Nvidia?



So you should probably be focusing on variables of type Y. How do I identify them with surety?

#5
Posted 04/23/2012 10:05 AM   
[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']
Am I understanding it correctly?
[/quote]
Yes, I think so.

[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']
Do you happen to remember where this been stated by Nvidia?
[/quote]
It is stated in the Programming Guide:
[quote name='Nvidia CUDA C Programming Guide version 4.2, Section 5.3.2.2']
Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g. same index in an array variable, same member in a structure variable).
[/quote]

[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']
So you should probably be focusing on variables of type Y. How do I identify them with surety?
[/quote]
The only way to be sure (I think) is to look at disassembled device code using cuobjdump -sass. But your description makes it pretty clear that the compiler has no better option for this variable. You might however manually place it in shared memory (with suitable threadIdx-dependent addressing).
[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']

Am I understanding it correctly?



Yes, I think so.



[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']

Do you happen to remember where this been stated by Nvidia?



It is stated in the Programming Guide:

[quote name='Nvidia CUDA C Programming Guide version 4.2, Section 5.3.2.2']

Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g. same index in an array variable, same member in a structure variable).





[quote name='sidxavier' date='23 April 2012 - 11:05 AM' timestamp='1335175512' post='1399762']

So you should probably be focusing on variables of type Y. How do I identify them with surety?



The only way to be sure (I think) is to look at disassembled device code using cuobjdump -sass. But your description makes it pretty clear that the compiler has no better option for this variable. You might however manually place it in shared memory (with suitable threadIdx-dependent addressing).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#6
Posted 04/23/2012 10:22 AM   
Thanks Tera. This was very helpful.
I will to analyse my kernel and various profiler stats more and get back here.

Sid.
Thanks Tera. This was very helpful.

I will to analyse my kernel and various profiler stats more and get back here.



Sid.

#7
Posted 04/23/2012 10:29 AM   
I think the best thing is for you to declare your arrays in shared memory and have full control over them.
I think the best thing is for you to declare your arrays in shared memory and have full control over them.

#8
Posted 04/23/2012 02:23 PM   
Or just configure the device for more L1 cache and less shared memory.
Or just configure the device for more L1 cache and less shared memory.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#9
Posted 04/23/2012 03:55 PM   
[quote name='tera' date='23 April 2012 - 10:25 PM' timestamp='1335196550' post='1399892']
Or just configure the device for more L1 cache and less shared memory.
[/quote]

I have this in my implementation currently.
My L1 cache hit rates -
L1 local Hit rate - 86%
L1 Global hit rate - 72%

[quote]
I think the best thing is for you to declare your arrays in shared memory and have full control over them.
[/quote]

It should be helpful. I will try this and report if I get improvement.

Sid.

PS - Quick question - Is there an obvious reason for a kernel with a config having bigger block size to run faster than one having smaller block size even if gridsizes have been rearranged in such a way that in both cases same number of warps are roughly invoked on each SM. eg (blockSize X GridSize) - 64 X 96 is faster than 32 X 192 ? There is no intra block communication etc.
[quote name='tera' date='23 April 2012 - 10:25 PM' timestamp='1335196550' post='1399892']

Or just configure the device for more L1 cache and less shared memory.





I have this in my implementation currently.

My L1 cache hit rates -

L1 local Hit rate - 86%

L1 Global hit rate - 72%





I think the best thing is for you to declare your arrays in shared memory and have full control over them.





It should be helpful. I will try this and report if I get improvement.



Sid.



PS - Quick question - Is there an obvious reason for a kernel with a config having bigger block size to run faster than one having smaller block size even if gridsizes have been rearranged in such a way that in both cases same number of warps are roughly invoked on each SM. eg (blockSize X GridSize) - 64 X 96 is faster than 32 X 192 ? There is no intra block communication etc.

#10
Posted 04/23/2012 07:45 PM   
Yes: Even numbers of warps are slightly faster for a number of reasons, not all of them documented. And even 64 threads per block don't allow reaching full occupancy due to the limit on number of resident blocks per SM.

OTOH you want to allow for more than one block per SM, and occupancy is not equivalent to speed either.
Yes: Even numbers of warps are slightly faster for a number of reasons, not all of them documented. And even 64 threads per block don't allow reaching full occupancy due to the limit on number of resident blocks per SM.



OTOH you want to allow for more than one block per SM, and occupancy is not equivalent to speed either.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#11
Posted 04/24/2012 10:09 AM   
[quote name='tera' date='24 April 2012 - 04:39 PM' timestamp='1335262190' post='1400244']
Yes: Even numbers of warps are slightly faster for a number of reasons, not all of them documented. And even 64 threads per block don't allow reaching full occupancy due to the limit on number of resident blocks per SM.

OTOH you want to allow for more than one block per SM, and occupancy is not equivalent to speed either.
[/quote]

Do you happen to know the limit to maximum number of resident blocks for Tesla 2070 (448 core - probably 28 SM)?
[quote name='tera' date='24 April 2012 - 04:39 PM' timestamp='1335262190' post='1400244']

Yes: Even numbers of warps are slightly faster for a number of reasons, not all of them documented. And even 64 threads per block don't allow reaching full occupancy due to the limit on number of resident blocks per SM.



OTOH you want to allow for more than one block per SM, and occupancy is not equivalent to speed either.





Do you happen to know the limit to maximum number of resident blocks for Tesla 2070 (448 core - probably 28 SM)?

#12
Posted 04/24/2012 10:25 AM   
Yes: It's 8 blocks per SM for all but the newest Kepler devices. Check Appendix F of the Programming Guide.
Yes: It's 8 blocks per SM for all but the newest Kepler devices. Check Appendix F of the Programming Guide.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#13
Posted 04/24/2012 10:47 AM   
Scroll To Top