Profiler returning nonsense memory statistics

There seems to be an issue with the CUDA 7 nvvp when showing memory bandwidth statistics from the TX1 (see attached image, or Imgur: The magic of the Internet.

It’s claiming an L2 cache rate of 14 PB/s and a unified cache rate of 99 PB/s - and I don’t believe TX1 is that fast :-). This is a trivial kernel that just copies one 65536-element array to another. Code is below. Using CUDA 7 from cuda-repo-ubuntu1404-7-0-local_7.0-71_amd64.deb.

#!/usr/bin/env python
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray
import numpy as np

src = """
__global__ void stuff(const int * __restrict__ data, int * __restrict__ out)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = data[idx];
}
"""

module = pycuda.compiler.SourceModule(src)
stuff = module.get_function('stuff')
blocks = 256
blockdim = 256
N = blocks * blockdim
data = pycuda.gpuarray.GPUArray(N, np.int32)
out = pycuda.gpuarray.GPUArray(N, np.int32)
stuff(data, out, block=(blockdim, 1, 1), grid=(blocks, 1))

EDIT: no idea how you’re supposed to put images into a post, so posted a link to imgur instead.

Hi bmerry,
Would you please file this topic into the CUDA board - https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/ ?
This could help to get the corresponding reply correct and better.

Cheers.

Hi bmerry,
Thanks for reporting the issue, we are currently investigating the case and we’ll let you know when we have an update.

Cheers

Hi Bmerry,
We’re not able to repro now.
Would you please provide more details like what image used and exact repro step?

Cheers