Upgrading Gfx Card and Cuda (to 8.0 RC) Causes Slowdown

I recently purchased a GTX 1080 and confirmed it was the main card used by Cuda with the deviceQuery application. Additionally, I did a quick swap of Cuda 7.5 for the 8.0 toolkit. My kernel, which had a running time of 45ms with a 750 TI is now taking 1100ms to complete. No Visual Studio settings have been changed outside of the project’s build customizations from Cuda 7.5 to 8.0 RC. The code is being compiled for release mode, and I have checked thoroughly that the -G is not present.

What’s even more infuriating is that any combination of 7.5, 8.0, and the respective ‘compute_xx, sm_xx’ flags fails to provide that original speed I once had. By setting the compute device to

cudaSetDevice(1);

instead of

cudaSetDevice(0);

I should be using the old 750 TI card according to deviceQuery, and even this change with the previous combinations do not yield any of the old speed results. All recent drivers have been tested with the above.

I have exhausted my working knowledge and am looking for help, however vague, to this peculiar project environment problem.

The huge difference in timings suggests that you are comparing a release build with a debug build, contrary to your assertion that no build settings changed.

The first thing in addressing situations like these is to establish a base line. Go back to the same driver version, tool chain, build settings, and GPU you started out with.

If that does not get you back to the original performance, there is little one can do to assist remotely. Maybe the performance measurement methodology used is flawed, maybe earlier performance data was recorded incorrectly, maybe a performance-relevant setting (such as an environment variable or compiler switch) was modified inadvertently. Maybe the ambient temperature is higher now, causing thermal throttling.

Once you have solid base line numbers, you can perform controlled experiments, where exactly one variable is changed at any given time. E.g. Update the driver, then update CUDA, then switch the GPU. If there is a performance regression along the way, that way you will have a good idea what general area to investigate, and can do a deep dive on that.

A methodical step-wise approach is essential in resolving these kind of issues.

For completions sake, I want to mention that this slowdown was caused by the change in how memory was automatically transfered with managed memory. This of course was being registered as being a slowdown with the kernel due to page faulting. Thanks njuffa for the motivation and help which ended up with this solution!

related:

[url]c++ - Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged - Stack Overflow

There was not enough context in your original posting for me to make the connection previously.