Cuda 6.0 performance drop comparing to 5.5

Hello,

I’ve found the performance issue in CUDA 6.0 integer arithmetic. I’ve a few kernels calculating SHA family hashes (SHA-1, SHA-256 etc) and all of them are running 50-80% slower on Fermi and Kepler comparing to CUDA 5.5 release.
I’ve done the small example: [url]http://www.crark.net/download/cuda55_vs_6.zip[/url]. It calcucates SHA-256 of very long string.
Instructions:

  1. compile test.cu to ptx, next to cubin using the batch file iclb-cubin-v6.bat
  2. open solution in VS 2012 and build the executable
  3. run test.exe and check the speed.
  4. if you don’t have both CUDA versions, there are already compiled files in PTX_CUBIN folder.
    I noticed that CUDA 6.0 has quite the same rate as CUDA 5.0, but CUDA 5.5 is running significantly faster.
    Please confirm my results and please give me the possible solution.
    Thank you.

Given the magnitude of the performance differences, the first thing that comes to mind is that one build could be a debug build, while the other build is a release build. I am not aware of any significant code generation differences between CUDA 5.0, 5.5, and 6.0 for the class of codes mentioned. I would expect any performance differences to be in the single-digit percent range.

If, after careful review of the compilation settings, the performance differences persist, I would suggest filing a bug using the bug reporting form linked from the registered developer website.

I’ve seen performance regressions up to 30% in cudaMiner going from CUDA 5.5 to 6.0.
So I never upgraded to 6.0

I’ve heard through the grapevine that CUDA 6.5 might fix some of these problems. An early
access preview is available now to selected developers.

Christian

Christian, do you remember what the cause of those performance regressions was? In which way did code generation change for the worse? Did you file a bug at the time? It would certainly be helpful if bugs were filed for performance regressions of that magnitude.

My memory may be failing me, but from the spot checks I have done on verious secure hashes I do not recall any significant code generation issues in recent CUDA versions.

I previously isolated some simple code which did have a 40% slowdown when compiled in 6.0 vs 5.5. It’s likely a rare corner case since it was the worst behaving function out of several thousand I compared. The speeds between 5.5 and 6.0 are rarely identical. With several thousand functions, the speed differences form a distribution which looks like a normal curve with a mean deviation of about 5%, but obviously there are outliers like that 40% example.

I filed a report with a reproducable test case back in March. My suspicion is that it’s due to register allocation in ptxas which sometimes causes inefficiencies in small tight loops. This is alluded to in this paper where they discuss the performance subtleties of register banking in SASS execution.

I did write up some more analysis and timing histograms and sent them to Kevin Kang a few months back. Norbert, if you’re interested, I’ll forward it to you if you want to message me your email address.

Thanks for filing the bug. As you say, in every transition the performance difference for kernels follows a (possibly skewed) normal distribution, and outliers of both positive and negative kind will occur. While performance is tracked for various kernels, the potential “kernel universe” is basically infinite, so not all performance regressions can be found in internal testing. What you say about the susceptibility of small tight loops to code generation artifacts is also true, and if that tight loop dominates the overall kernel performance it will be very noticeable. Several of my codes have run into such issues in the past and one such bug is still open.

In this case I was just surprised to hear of performance regressions in codes for secure hashes and other crypto codes with recent tool chains as I have neither personally seen any nor heard or read of any to date. That does not mean such regressions did not occur, I may simply be unaware of them.

I am taking this opportunity to re-iterate that it is very helpful when CUDA programmers report any significant performance regressions through bug reports, as a well-written bug report with self-contained repro code attached is the fastest path to get such issues resolved. Thank you to all who file bug reports.

In practical terms, trying the latest tool chain (including possibly release candidates if that makes sense) as suggested by Christian is always a good idea.

I submitted the bug to CUDA team, and today I’ve got the reply:
“I have reproduced the performance drop issue on Win8_x64/GTX650Ti setup with CUDA6.0. However, this problem has been fixed in our development versions.
The new version which contains this fix would be available in the next CUDA release”