Execution time different between arch=sm_11 and sm_20? (w/ Cmake)
Hi all,

I was wondering if anyone had some insights into some behavior I have been seeing. I have my cuda application, which I compile using CMAKE (and the convenient FindCuda.cmake).

In my CmakeLists.txt file, I am using the following to set the architecture:
SET(CUDA_NVCC_FLAGS "-arch=sm_20")

With sm_20 (running on a fermi Tesla C2050), it compiles with 47 registers per thread and a little under 1KB shared memory.

Now when I set it to sm_11, each thread uses 14 registers and a little over 1KB of shared memory. I also get about a 1.3x speedup over using sm_20!

Any ideas? Thanks!
Hi all,



I was wondering if anyone had some insights into some behavior I have been seeing. I have my cuda application, which I compile using CMAKE (and the convenient FindCuda.cmake).



In my CmakeLists.txt file, I am using the following to set the architecture:

SET(CUDA_NVCC_FLAGS "-arch=sm_20")



With sm_20 (running on a fermi Tesla C2050), it compiles with 47 registers per thread and a little under 1KB shared memory.



Now when I set it to sm_11, each thread uses 14 registers and a little over 1KB of shared memory. I also get about a 1.3x speedup over using sm_20!



Any ideas? Thanks!

#1
Posted 09/10/2010 12:51 AM   
Hi all,

I was wondering if anyone had some insights into some behavior I have been seeing. I have my cuda application, which I compile using CMAKE (and the convenient FindCuda.cmake).

In my CmakeLists.txt file, I am using the following to set the architecture:
SET(CUDA_NVCC_FLAGS "-arch=sm_20")

With sm_20 (running on a fermi Tesla C2050), it compiles with 47 registers per thread and a little under 1KB shared memory.

Now when I set it to sm_11, each thread uses 14 registers and a little over 1KB of shared memory. I also get about a 1.3x speedup over using sm_20!

Any ideas? Thanks!
Hi all,



I was wondering if anyone had some insights into some behavior I have been seeing. I have my cuda application, which I compile using CMAKE (and the convenient FindCuda.cmake).



In my CmakeLists.txt file, I am using the following to set the architecture:

SET(CUDA_NVCC_FLAGS "-arch=sm_20")



With sm_20 (running on a fermi Tesla C2050), it compiles with 47 registers per thread and a little under 1KB shared memory.



Now when I set it to sm_11, each thread uses 14 registers and a little over 1KB of shared memory. I also get about a 1.3x speedup over using sm_20!



Any ideas? Thanks!

#2
Posted 09/10/2010 12:51 AM   
Same here. (Very) Different reg counts for compiling with sm_11 and sm_20. But I do not have a clue why this is the case. If any one has ideas and suggestions this would be great!

Best, Manuel
Same here. (Very) Different reg counts for compiling with sm_11 and sm_20. But I do not have a clue why this is the case. If any one has ideas and suggestions this would be great!



Best, Manuel

#3
Posted 04/25/2012 11:49 AM   
With current toolkits, different compilers are used for sm_11 and sm_20, which might explain the different findings (although other explanations are also possible - the architectures are different after all). Try compiling both cases with either [font="Courier New"]nvcc -nvvm[/font] (to force using the new LLVM based compiler) or [font="Courier New"]nvvc -open64[/font] (to use the old Open64 based one) to see if performance for both architectures comes closer to each other.

Keep in mind though that in any case, the resulting PTX code will get translated to a sm_20 binary at runtime, as this is the only code the C2050 is able to execute.
With current toolkits, different compilers are used for sm_11 and sm_20, which might explain the different findings (although other explanations are also possible - the architectures are different after all). Try compiling both cases with either nvcc -nvvm (to force using the new LLVM based compiler) or nvvc -open64 (to use the old Open64 based one) to see if performance for both architectures comes closer to each other.



Keep in mind though that in any case, the resulting PTX code will get translated to a sm_20 binary at runtime, as this is the only code the C2050 is able to execute.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 04/25/2012 01:17 PM   
Scroll To Top