This indicates that local memory is being used for something, although I don’t know exactly what. I suspect that one of my most frequently used variables is being placed in lmem, because writing to it is responsible for 37% of my application runtime. Is there any way to force the compiler to place a variable in register memory and allow other things to get dumped to lmem instead?
Edit:
corrected a mistake in the reported % of app runtime.
Edit:
typo correction
There is no such way as far as I know. You can inspect output file with some tool to check what variables were placed in local memory. Btw, looks like you are limitted by shared memory size, only 1 block on gt200, if your block size is not large, you can use more registers. You may try to use shared memory to temporal variable holding, anyway you are not using about 5KB.
You might place variables in shared memory to free registers without spilling to local memory.
Preferably in such a way that no two operands of any instruction come from shared memory, as that apparently would require to move one of them through a register again.
Can you suggest a program that would be able to inspect the code and tell me which variables are being stored in local memory? This would help me gain confidence that my diagnosis is correct.
Yes i think that 1 active block per SM can possibly be more of a bottleneck than spilling into lmem ( unless this is a 1.4 device ). So if you could both decrease lmem and smem usage that would be awesome :)
In some cases you can reduce your register use by thinking like a register allocator. i.e. be aware what the life time and scope of your variables is, andconsider moving variables into a narrower scope (i.e. recompute an index variable instead of using top level scope for the variable). This may be faster than letting the compiler spill some random other variables.
I recently got down from 24 registers to 14 by doing some smart thinking (I had a hard limit of 16 registers because I need 480 threads or so on Compute 1.1 devices).
As a first approach implemented a for loop to summate three contributions to the final solution, which unfortunately exceeded my register limit.
In my case it turned out that I didn’t really need to use temporary variables to accumulate the contributions in registers (shared memory wouldn’t be an option either, I would have needed >16kb). But summating the contributions in global memory was indeed an option for me. And I didn’t need to use a loop - I was able to unroll the entire computation. I placed most variables in a very narrow scope and ended up with 14 registers used.
In your case, I propose to investigate first whether the “volatile trick” can bring any gains. Declare some index and loop variables volatile to see whether you save some lmem or registers in the process.
Thank you all for your replies. I was able to get everything into register memory. Variables are no longer spilling into local memory. Performance has improved, but not as much as I hoped it would.
I do not understand this comment. What do you mean by “one active block per SM”? Can someone please elaborate on why my shared memory usage is slowing down my program. I am pretty much just using shared memory as a cache, so it won’t be difficult for me to reduce the amount that I am using. But why would I want to reduce the size of this cache?
Since my device contains 16384 bytes of shared memory per block, my understanding of your advice is that this should permit 2 active blocks per multiprocessor. However, the change resulted in no improvement in execution time. Decreasing the cache size should not have impacted performance significantly. Did I understand the advice correctly?
This means you still run only one block per SM (and have, at this block size, little chance to change that, unless you could get the register count down to 16), so there would be no improvement and you could as well increase your shared memory use to close to 16kb.
It might be beneficial to reduce the block size to 256 threads to run 2 blocks per SM. Whether or not this helps would very much depend on the memory access patterns, so you just have to try.
Other than that, I guess we cannot help you with optimization unless give more insight into what you are doing or post real code.
Thank you tera. This was very good advice. I lowered my block size to 128 and got the register usage under 64 (59 actually). This resulted in an 11% improvement in performance. I would also like to thank all the other posters in this thread. Following the advice and suggestions provided resulted in a 27% improvement in performance. I am creating a new thread about the CUDA profiler tool. Please drop by if you get a chance. - Bill