Performance is much better when profling with NSight than when running production code

Hi,

I have encountered a very strange problem. Unfortunately a rather worrying one if our Cuda code is ever to make it into “production code”.

I have two almost identical machines. The only difference is that one has a single GPU (a Geforce Titan) and the other two GPUs - one of them being a Titan Black.

On the single GPU machine I can launch our Cuda based app and it gets accelerated very well compared to our fallback host code. On the other machine things are not as well unfortunately:

When I launch the application by running the .exe (from explorer/command prompt/debugger) performance is about 8 times slower than expected. However, if I launch the application using a profiler (NSight/Visual Studio or nvvp) I get good performance (i.e. it runs approx. 8 times faster than when I am not profiling). This goes for both our own kernels as well as e.g. cuFFT calls.

This behaviour is not related to two GPUs being present. I am using the Titan, and the same thing happens if I take out the second card.

If anybody has any clues as to how this could be, please let me know. Any help would be appreciated. What could the profilers be doing that I am not?

I am using Cuda 6.0, the runtime API, Windows 8.1 64bit. Today reinstalled everything “Nvidia related” on the machine without resolving the problem.

/ tsan

Nsight Visual Studio Edition supports attaching to a process to perform CUDA Debugging. If you have enabled this mode in the CUDA debugger then you may find that additional overhead has been introduced into the application. Make sure the following environment variables are not defined:

  • NSIGHT_CUDA_DEBUGGER=1
  • CUDA_INJECTION32_PATH
  • CUDA_INJECTION64_PATH

If the CUDA Debugger is set to be in pre-emption mode (desktop is enabled on the card) then setting NSIGHT_CUDA_DEBUGGER=1 will serialize kernel calls and add additional overhead to some CUDA API calls.

The Nsight Visual Studio CUDA Profiler, NVVP, and nvprof will override these settings.

It is recommended that you only set NSIGHT_CUDA_DEBUGGER when you want to debug. Please do not set it at a system level.

Greg, thank you very much for sharing this insight.

I indeed had those environment variables defined. Removing them immediately increased the performance of our compute kernels! Thanks again, your help is really appreciated.

/tsan