CUDA 4.1 vs. 3.2 register allocation...
Hi All -

I am having a problem with programs I wrote using CUDA 3.2. The programs are designed to map local arrays to register memory using constant addressing. This works great when I compile with CUDA 3.2, however now when I compile with the "improved" 4.1 compiler all of my registers are spilling into local memory. Has anyone else had this issue? Does anyone have any suggestions as to what might be causing this? I had heard 4.1 was much better about register allocation but clearly I am not seeing it.

Thanks
Hi All -



I am having a problem with programs I wrote using CUDA 3.2. The programs are designed to map local arrays to register memory using constant addressing. This works great when I compile with CUDA 3.2, however now when I compile with the "improved" 4.1 compiler all of my registers are spilling into local memory. Has anyone else had this issue? Does anyone have any suggestions as to what might be causing this? I had heard 4.1 was much better about register allocation but clearly I am not seeing it.



Thanks

#1
Posted 04/20/2012 10:25 PM   
Hi,
For compute capability 2.0 onward, the 4.1 compiler uses LLVM for GPU code rather than Open64 as it did beforehand. AFAIK, this usually leads to better performances but should you encounter some problem, you always can move back to the Open64 compiler with the -open64 switch.
Hi,

For compute capability 2.0 onward, the 4.1 compiler uses LLVM for GPU code rather than Open64 as it did beforehand. AFAIK, this usually leads to better performances but should you encounter some problem, you always can move back to the Open64 compiler with the -open64 switch.

#2
Posted 04/21/2012 08:11 PM   
It is not clear which architecture your builds are targeting, but I assume it is compute capability 2.x?

Note that use of -open64 in conjunction with sm_2x targets is not being tested, and that this undocumented flag could go away from one release to the next without going through staged deprecation. Does -Xptxas -v explicitly report spills? Not all local memory usage is due to register spills. For example, use of the volatile qualifier can also force variables into local memory. You would want to remove all uses of volatile not neeeded for functional correctness. I am mentioning this since some CUDA programmers discovered that with Open64, volatile could be exploited to influence register pressure, taking advantage of a compiler artifact that no longer exists in the NVVM frontend used for 2.x and 3.x targets.

Compilers tend to incorporate many heuristics, and while a particular combination of those heuristics may gove a boost to most codes, there will be some codes which do not benefit, and a third category will even see worse performance. If the slowdown you are seeing has a significant impact on the performance of your application, I would suggest filing a bug against the compiler attaching a repro case and performance numbers showing the performance drop from CUDA 3.2 to CUDA 4.1. Thank you for your help, and sorry for the inconvenience.
It is not clear which architecture your builds are targeting, but I assume it is compute capability 2.x?



Note that use of -open64 in conjunction with sm_2x targets is not being tested, and that this undocumented flag could go away from one release to the next without going through staged deprecation. Does -Xptxas -v explicitly report spills? Not all local memory usage is due to register spills. For example, use of the volatile qualifier can also force variables into local memory. You would want to remove all uses of volatile not neeeded for functional correctness. I am mentioning this since some CUDA programmers discovered that with Open64, volatile could be exploited to influence register pressure, taking advantage of a compiler artifact that no longer exists in the NVVM frontend used for 2.x and 3.x targets.



Compilers tend to incorporate many heuristics, and while a particular combination of those heuristics may gove a boost to most codes, there will be some codes which do not benefit, and a third category will even see worse performance. If the slowdown you are seeing has a significant impact on the performance of your application, I would suggest filing a bug against the compiler attaching a repro case and performance numbers showing the performance drop from CUDA 3.2 to CUDA 4.1. Thank you for your help, and sorry for the inconvenience.

#3
Posted 04/21/2012 10:10 PM   
I ran into the same problem with 4.1. I expected that the compiler would be able to properly expand constant addressing of an array into registers. It didn't and some effort was spent trying to find out why to no avail. I chalked it up to my particular snippet of code being outside the norm and moved on...

The work-around in my case was to descend into the baroque use of macros to expand my register-using code blocks. That has worked really well but I do not recommend this approach since source-level debugging is pretty much lost.

You should write up a bug and get it into the CUDA team so they know this problem exists. You should also verify that it hasn't been fixed in 4.2.
I ran into the same problem with 4.1. I expected that the compiler would be able to properly expand constant addressing of an array into registers. It didn't and some effort was spent trying to find out why to no avail. I chalked it up to my particular snippet of code being outside the norm and moved on...



The work-around in my case was to descend into the baroque use of macros to expand my register-using code blocks. That has worked really well but I do not recommend this approach since source-level debugging is pretty much lost.



You should write up a bug and get it into the CUDA team so they know this problem exists. You should also verify that it hasn't been fixed in 4.2.

#4
Posted 04/22/2012 04:31 AM   
Thanks for your responses. I am targeting an embedded system with a GE GRA111, it is basically a modified 240GT with compute version 1.2.

I really hope I can figure out how to do this without dealing with MACRO expansions, I will try using CUDA 4.2 and see if it clears the problem up.

Thanks!
Thanks for your responses. I am targeting an embedded system with a GE GRA111, it is basically a modified 240GT with compute version 1.2.



I really hope I can figure out how to do this without dealing with MACRO expansions, I will try using CUDA 4.2 and see if it clears the problem up.



Thanks!

#5
Posted 04/23/2012 05:36 PM   
I just tried the -open64 option and it didn't change the register allocation. When I check the command line help on my linux installation of nvcc it doesn't list the -open64 option, is this option supported in the linux version of nvcc?
I just tried the -open64 option and it didn't change the register allocation. When I check the command line help on my linux installation of nvcc it doesn't list the -open64 option, is this option supported in the linux version of nvcc?

#6
Posted 04/23/2012 08:29 PM   
For compute capability 1.x [font="Courier New"]-open64[/font] is the default. Try [font="Courier New"]-nvvm[/font] instead to use the new LLVM based compiler.

Both options are undocumented and thus not officially supported, but they seem to work well.
For compute capability 1.x -open64 is the default. Try -nvvm instead to use the new LLVM based compiler.



Both options are undocumented and thus not officially supported, but they seem to work well.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#7
Posted 04/24/2012 10:29 AM   
Scroll To Top