I noticed (via VS code profiler) that cublasDdot, when called in host pointer mode, is absurdly slow: it is about 20x slower than calling it in device pointer mode AND manually transferring the result back to the host with cudaMemcpy.
I am OK with my solution of transferring the result manually, but I wanted to post here because I presume this is a bug that should be addressed in future versions.
System properties:
Windows 8.1 OS
Visual Studio 2013
CUDA 7.5
Tesla K40 GPU