Compute Capability 1.0 faster than 3.5?
I compiled my cuda code with 'compute_10,sm_10' and 'compute_35,sm_35', the 1.0 version is 30% faster than the 3.5 version. Is there anything wrong? My card has capability 3.5.
I compiled my cuda code with 'compute_10,sm_10' and 'compute_35,sm_35', the 1.0 version is 30% faster than the 3.5 version. Is there anything wrong? My card has capability 3.5.

#1
Posted 08/07/2013 10:16 PM   
Hard to give advice without the problematic code :)
Hard to give advice without the problematic code :)

#2
Posted 08/08/2013 02:04 AM   
I tend to compile to CC 1.0 or 1.1 also unless I specifically require language or device features from later CUDA versions. You also get the benefit of an increased hardware capability (older devices and older drivers will run your code) The reasons for compute 1.x targets being faster might be that up to 124 registers are available for compute 1.x targets, and the run time (JIT) conversion from PTX 1.x code to the target hardware (which may be limited to 64 registers/thread) seems to do a really good job minimizing register spills. Also compiling to compute 1.x uses the Open64 based compiler while newer targets use the LLVM based compiler. Both may exhibit distinctly different performance characteristics due to different optimization strategies. Christian
I tend to compile to CC 1.0 or 1.1 also unless I specifically require language or device features from later CUDA versions. You also get the benefit of an increased hardware capability (older devices and older drivers will run your code)

The reasons for compute 1.x targets being faster might be that up to 124 registers are available for compute 1.x targets, and the run time (JIT) conversion from PTX 1.x code to the target hardware (which may be limited to 64 registers/thread) seems to do a really good job minimizing register spills.

Also compiling to compute 1.x uses the Open64 based compiler while newer targets use the LLVM based compiler. Both may exhibit distinctly different performance characteristics due to different optimization strategies.

Christian

#3
Posted 08/08/2013 09:01 AM   
If the code uses double precision math, maybe it is being demoted to single precision.
If the code uses double precision math, maybe it is being demoted to single precision.

#4
Posted 08/08/2013 05:00 PM   
The code only uses float and short, does not use DP. The code does use quite a few registers. Thanks Christian for the explanation. But I still think the newer compiler should at least do the same good job as the older ones.
The code only uses float and short, does not use DP. The code does use quite a few registers. Thanks Christian for the explanation. But I still think the newer compiler should at least do the same good job as the older ones.

#5
Posted 08/08/2013 05:36 PM   
[quote="ndzuser"]The code only uses float and short, does not use DP.[/quote] Gogar may still be right. Double and triple check that you have no doubles, especially in constants. It is very easy to accidentally write code like "a+=3.14159" or "a+=2.0*b". Even if a and b are both floats, those are still double precision computes because they use a double precision constant. They should be written like "a+=2.0f*b" to force single precision.
ndzuser said:The code only uses float and short, does not use DP.


Gogar may still be right. Double and triple check that you have no doubles, especially in constants. It is very easy to accidentally write code like "a+=3.14159" or "a+=2.0*b". Even if a and b are both floats, those are still double precision computes because they use a double precision constant. They should be written like "a+=2.0f*b" to force single precision.

#6
Posted 08/08/2013 06:05 PM   
Double checked and it does not use double.
Double checked and it does not use double.

#7
Posted 08/08/2013 06:30 PM   
Although not always illuminating (since it isn't the final GPU machine code), you might want to see what the generated PTX looks like for both kernels. Also, it's worth checking how many registers both cases use. Do you try different block and grid configurations to find the best one for your kernel?
Although not always illuminating (since it isn't the final GPU machine code), you might want to see what the generated PTX looks like for both kernels.

Also, it's worth checking how many registers both cases use. Do you try different block and grid configurations to find the best one for your kernel?

#8
Posted 08/08/2013 07:10 PM   
Compute capability 1.0 does not support full IEEE floating point and did not have an ABI. If you compile for 3.5 with --use_fast_math and you specify the option to disable ABI compliance you should get comparable performance.
Compute capability 1.0 does not support full IEEE floating point and did not have an ABI.

If you compile for 3.5 with --use_fast_math and you specify the option to disable ABI compliance you should get comparable performance.

#9
Posted 08/09/2013 01:31 AM   
Scroll To Top