Very slow to launch just 1 kernel

Hi

I’ve been trying to trace this all day… Now slightly frustrated. :)

Timing the launch of a single null kernel shows it is taking over 1 ms on this machine. Scaled up in dimensions, the launching of the collection of the same null kernel (for debugging it just returns immediately) is taking 16 ms. (10240 blocks x 128 threads). That is 100x slower than Matlab doing the full operation. Surely this can’t be right?

From what I’ve been reading, the time to launch a kernel should be of the order of 20 us?

If I butcher one of the examples in the CUDA 7.0 installation, and do the same timing of a single kernel launch it is the same.

I’ve only just started out with CUDA, so I don’t know what information I should provide, but here is what I can start with:

MSVC 2013 Professional C++
CUDA 7.0
K2000M GPU
I7 at 2.8 GHz
Code generation is set as “compute_20,sm_20;compute_30,sm_30”
The GPU is running at about 50 degC, and has 60% memory free.

This is the way I am timing the launch:

cudaDeviceSynchronize();

cudaEventRecord(start);

kSumSq <<< 1,1 >>>(d_mean_sq, d_in, n_est, estlen);

cudaEventRecord(stop);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&times[1], start, stop);

This code has run much faster, (as in 35 ms for 262144 x 128 kernels running my full kernel code) but I cannot see what is different now.

If anyone could give any advice on how to go about finding out why this is, I’d be grateful.

Kind regards, Kevin

Have you turned on persistence mode?

sudo nvidia-persistenced --user you --persistence-mode

edit: nvm, you’re on windows

[s]More info:

I am writing a mex-file to be used by Matlab. It seems like trying to profile using NSight as integrated in MSVS does not detect any GPU activity (even though I do see the kernel has “apparently” run because the results are ok).

I tried using the Matlab Compiler to generate an *.exe, and tried the Visual Profiler, which also showed no data collected, and no kernels detected. (Trying this with one of the CUDA samples looks fine.)

Is it possible that the CPU is being used as a fallback?[/s]

I needed to add cudaDeviceReset() to flush the profiling data. Now I get results.