Execution time different between arch=sm_11 and sm_20? (w/ Cmake)

Hi all,

I was wondering if anyone had some insights into some behavior I have been seeing. I have my cuda application, which I compile using CMAKE (and the convenient FindCuda.cmake).

In my CmakeLists.txt file, I am using the following to set the architecture:
SET(CUDA_NVCC_FLAGS “-arch=sm_20”)

With sm_20 (running on a fermi Tesla C2050), it compiles with 47 registers per thread and a little under 1KB shared memory.

Now when I set it to sm_11, each thread uses 14 registers and a little over 1KB of shared memory. I also get about a 1.3x speedup over using sm_20!

Any ideas? Thanks!

Same here. (Very) Different reg counts for compiling with sm_11 and sm_20. But I do not have a clue why this is the case. If any one has ideas and suggestions this would be great!

Best, Manuel

With current toolkits, different compilers are used for sm_11 and sm_20, which might explain the different findings (although other explanations are also possible - the architectures are different after all). Try compiling both cases with either [font=“Courier New”]nvcc -nvvm[/font] (to force using the new LLVM based compiler) or [font=“Courier New”]nvvc -open64[/font] (to use the old Open64 based one) to see if performance for both architectures comes closer to each other.

Keep in mind though that in any case, the resulting PTX code will get translated to a sm_20 binary at runtime, as this is the only code the C2050 is able to execute.