CUDA 6.5 understands K80 and sm_37?

Hi, I tried to compile my code in both cuda 6.5 and 7.0 and got different output. As you can see that in cuda 6.5 there is some register spilling, while there is no such spilling in cuda 7.0. I cannot compare the performance because the K80 machine cannot install cuda 7.0 (for now).

Just curious, is cuda 7.0 a must for K80?

CUDA 6.5
ptxas info : Compiling entry function ‘_Z28updateXByBlock2pRegDsmemTileiPfPKiS1_f’ for ‘sm_37’
ptxas info : Function properties for _Z28updateXByBlock2pRegDsmemTileiPfPKiS1_f
48 bytes stack frame, 80 bytes spill stores, 80 bytes spill loads
ptxas info : Used 128 registers, 12000 bytes smem, 360 bytes cmem[0], 8 bytes cmem[2], 1 textures

CUDA 7.0
ptxas info : Compiling entry function ‘_Z28updateXByBlock2pRegDsmemTileiPfPKiS1_f’ for ‘sm_37’
ptxas info : Function properties for _Z28updateXByBlock2pRegDsmemTileiPfPKiS1_f
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 12000 bytes smem, 360 bytes cmem[0], 8 bytes cmem[2], 1 textures

Thanks,
Wei

Given that the CUDA 6.5 toolchain apparently successfully compiled the code for sm_37, why not simply run the resulting executable? As I recall the release of K80 predated the release of CUDA 7.0, so I think the K80 should work fine with CUDA 6.5. But I have not used a K80 and so do not have first-hand experience.

As expected, since there is register spill, the performance is not as expected. My code is register and smem bounded, but in K80 I do not see any benefit given its doubled register and smem.

I will ask IT to install cuda 7.0, try it and report back. Just wonder if there is any similar experience from other users.

No benefit relative to what? A Tesla K40? What occupancy are you observing on the K40 and what on K80?

It seems you are imposing a limit on the register use of the kernel, either with launchbounds or -maxrregcount, since 255 per thread are available, but only 128 are used and there is spilling. How does the performance change if you remove whatever limits the register count?

Certainly installing CUDA 7.0 seems like a good idea at this point.