Tesla K40c Cuda6.5 Visual studio 2010 x64 system vs Cuda 3.2 Visual Studio 2008 x64 performance

The project is working on windows 7 x64 operating system. The project that was built with cuda3.2 vs2008 x64 has a better performance than cuda 6.5 vs2010 x64. What will cause the performance decrement in the new setup? I guess there should be smt done for the configuration of the new setup. The first project with cuda 3.2 can process the same data in 60 ms whereas with cuda 6.5 setup can process in 300 ms. I will be grateful if someone can help me to figure out the reason of the performance decrement.

I assume you have already established that the performance difference is due to the GPU portion of the application, and not the CPU portion? Without knowing any specifics about the application and the system, it is difficult to offer specific advice. Does this application use mostly single-precision computation or mostly double-precision computation?

The huge difference in performance suggests that the slower configuration may be a debug build, while the faster configuration is a release build. Also check for any environment variables that could negatively impact performance such as CUDA_LAUNCH_BLOCKING=1 or CUDA_PROFILE=1 that may accidentally have been left set.

It is not clear whether this is identical system configuration with the only update being the tool chain, or whether these are two different systems. If they are two different system, make sure there is proper power suply for the K40c (one six-pin and one eight-pin connector, I think) and that there is no overheating. You can use nvidia-smi to check the GPU temperature. Make sure the K40c is a x16 PCIe slot.

The difference is on the gpu portion of the application. It is working single precision and two systems are on the same computer. Just downloaded cuda 6.5 and vs2010 and the same project compiled with the related configuration(library and include files for cuda 6.5 and vs2010) debug x64 configuration. I will check if CUDA_LAUNCH_BLOCKING=1 or CUDA_PROFILE=1 is set.
I am observing the gpu temperature continuosly and have an extra external fan for cooling. K40c is in a x16 PCIe slot and has no problem with the supply.

Both are these builds are debug builds? If so, I do not see the point of measuring performance with debug builds, performance should only matter for release builds. It is entirely possible that improved debugging support in CUDA 6.5 causes the executables to incur significant additional overhead. What execution times do you observe when you switch to release builds?

Yes, they are both debug builds and CUDA_LAUNCH_BLOCKING=1 or CUDA_PROFILE=1 is not set. The time in cuda 3.2 is 60 ms and in cuda 6.5 is 300 ms in debug builds. Do you think it is normal to have that much difference with the improved debugging support? I will check the release build results as well. Thanks.

I have no notion what kind of performance is considered “normal” for debug builds. I never measure the performance of debug builds, it is a don’t care in my view since the purpose of a debug build is to allow debugging, not create a high-performance application.

It will be interesting to compare the execution times for the release builds. If you still see a large slowdown with CUDA 6.5 for those, you would want to file a bug report with NVIDIA (the bug reporting form is linked from the CUDA registered developer website).

The release builds shows that in VS2008 x64 cuda 3.2 release runs in 60 ms whereas vs2010 x64 cuda 6.5 release runs in 170 ms. I will file a bug report. Thanks.

Are you targetting the same compute capability in both builds? What compute capability are you targetting? You should carefully compare the nvcc compile command lines issued by VS in both cases, and test the differences, if any. In both cases, the code is executing on K40c? Are you doing proper cuda error checking, and have you run your code with cuda-memcheck in both cases to verify that there are no API errors reported in either case?

After the bug report as NVIDIA proposed I compiled using Cuda 7.0 and got %10 improvement compared to cuda3.2. However, in another algortihm using FTTs and IFFTs with FFT sizes 64 and 8192 using fftplanmany, process time doubled compared to the older version. In the known issues of 7.0 “The static library version of cuFFT has several known issues that are manifested only when execution is on a Maxwell GPU (sm50 or higher) and when a transform contains sizes that factor to primes in the range of 67–127. The library may run more slowly and require more memory than the cuFFT 6.5 release.” is written. So according to this is it normal to have double process time even if the FFT size is power of 2? I am using compute_35, sm_35 as compute capability since I have Tesla K40c in Cuda 7.0.

burcu, do you measure just cufftExecute* time or cufftPlan* as well?

@burcu

Would you please answer the above question in comments #10? Thanks.