CUDA 4.1 vs. 3.2

Hi All -

I am having a problem with programs I wrote using CUDA 3.2. The programs are designed to map local arrays to register memory using constant addressing. This works great when I compile with CUDA 3.2, however now when I compile with the “improved” 4.1 compiler all of my registers are spilling into local memory. Has anyone else had this issue? Does anyone have any suggestions as to what might be causing this? I had heard 4.1 was much better about register allocation but clearly I am not seeing it.

Thanks