Compute Capability 1.0 faster than 3.5?

ndzuser · August 7, 2013, 10:16pm

I compiled my cuda code with ‘compute_10,sm_10’ and ‘compute_35,sm_35’, the 1.0 version is 30% faster than the 3.5 version. Is there anything wrong? My card has capability 3.5.

sBc-Random · August 8, 2013, 2:04am

Hard to give advice without the problematic code :)

cbuchner1 · August 8, 2013, 9:01am

I tend to compile to CC 1.0 or 1.1 also unless I specifically require language or device features from later CUDA versions. You also get the benefit of an increased hardware capability (older devices and older drivers will run your code)

The reasons for compute 1.x targets being faster might be that up to 124 registers are available for compute 1.x targets, and the run time (JIT) conversion from PTX 1.x code to the target hardware (which may be limited to 64 registers/thread) seems to do a really good job minimizing register spills.

Also compiling to compute 1.x uses the Open64 based compiler while newer targets use the LLVM based compiler. Both may exhibit distinctly different performance characteristics due to different optimization strategies.

Christian

Gogar · August 8, 2013, 5:00pm

If the code uses double precision math, maybe it is being demoted to single precision.

ndzuser · August 8, 2013, 5:36pm

The code only uses float and short, does not use DP. The code does use quite a few registers. Thanks Christian for the explanation. But I still think the newer compiler should at least do the same good job as the older ones.

SPWorley · August 8, 2013, 6:05pm

Gogar may still be right. Double and triple check that you have no doubles, especially in constants. It is very easy to accidentally write code like “a+=3.14159” or “a+=2.0b". Even if a and b are both floats, those are still double precision computes because they use a double precision constant. They should be written like "a+=2.0fb” to force single precision.

ndzuser · August 8, 2013, 6:30pm

Double checked and it does not use double.

seibert · August 8, 2013, 7:10pm

Although not always illuminating (since it isn’t the final GPU machine code), you might want to see what the generated PTX looks like for both kernels.

Also, it’s worth checking how many registers both cases use. Do you try different block and grid configurations to find the best one for your kernel?

Greg · August 9, 2013, 1:31am

Compute capability 1.0 does not support full IEEE floating point and did not have an ABI.

If you compile for 3.5 with --use_fast_math and you specify the option to disable ABI compliance you should get comparable performance.