Extremely slow cudaMalloc call on GTX 1080 with CUDA8RC

Hello,

I recently bought new GTX 1080 as replacement for GTX 980 Ti. I ran few CUDA benchmarks and found out that calling cudaMalloc on GTX 1080 is almost 20x slower compared to GTX 980Ti. Please, see my code sample below.

Is this just a glich of realease candidate version of CUDA 8 and this will be fixed in full version?

Thanks a lot in advance
Cestmir

Environment:

OS: Windows 7 64 bit
nVIDIA Driver: 368.81 WHQL
CUDA Toolkit: both CUDA 7.5, CUDA 8RC

Source code:

int main(int argc, char **argv)
{
float *f_A, *f_B;

// warming up CUDA 
checkCudaErrors(cudaMalloc((void **)&f_A, 100*1024*1024));  // dummy allocation 100 MB

fnElapsedTime();
checkCudaErrors(cudaMalloc((void **)&f_B, 8053063680)); // alloc 7.5 GB
printf("cudaMalloc time: %.1lf sec.\n", fnElapsedTime());

// clean up memory
checkCudaErrors(cudaFree(f_A));
checkCudaErrors(cudaFree(f_B));

}

Output:

cudaMalloc time: 9.8 sec.