Question #1
I have a function Run() that calls execution of two kernels:
// As you see, I’m using events (eventRow, eventCol) because of profiling.
How expensive (time performance) is calling enqueueNDRangeKernel (or clEnqueueNDRangeKernel ).
With Nvidia OpenCL Profiler, I got total time of execution (on GPU) 351 ms, but when I measured time of running of method Run()
I got 622 ms.
Why this difference is so large?
When is data transfered to GPU, on calling clEnqueueNDRangeKernel or when buffer is created (clCreateBuffer)?
I tested on NVIDIA GT240.
I also tested on ATI HD 5670 and difference is much smaller.