cublasddot intolerably slow for host pointers on Windows with CUDA 7.5, Tesla K40

I noticed (via VS code profiler) that cublasDdot, when called in host pointer mode, is absurdly slow: it is about 20x slower than calling it in device pointer mode AND manually transferring the result back to the host with cudaMemcpy.

I am OK with my solution of transferring the result manually, but I wanted to post here because I presume this is a bug that should be addressed in future versions.

System properties:
Windows 8.1 OS
Visual Studio 2013
CUDA 7.5
Tesla K40 GPU

bugs should be filed at developer.nvidia.com