I’ve searched the kronos and nvidia forums. There are several rumors of how concurrent kernel execution works in openCL. But most discuss how data transfer and kernel exec can run concurrently, or running multiple gpu devices concurrently. The concurrent kernel execution in the Fermi white paper meant multiple kernels running in one device. As I have many small kernels throughout a main algorithm, it would help immensely for GPU not to wait serially for each small kernel.
Haven’t found it, but has the opencl 1.1/latest included this feature yet?
My setup:
[codebox]===================================================
========================
Platform ‘NVIDIA CUDA’
============================================================
===============
EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d
9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_com
piler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
NAME: NVIDIA CUDA
PROFILE: FULL_PROFILE
VENDOR: NVIDIA Corporation
VERSION: OpenCL 1.0 CUDA 3.1.1
Device ‘GeForce GTX 470’
MAX_COMPUTE_UNITS: 14
…[/codebox]
According to the whitepaper, the max kernels should be 16 (same as SM?), so GTX470 allows 14 kernels.
Testing with 2 kernels. Trying OUT_OF_ORDER gave no differnece.
Kernel call (clEnqueueNDRangeKernel) and read_buffer are indeed non-blocking (this is an optional flag defaulting to nonblocking). It’s tested by querying event1.profile.end at various times (throwing error if event1 is not finished by then). It’s observed a kernel may finish before/after the next kernel starts. But for equal sized kernels, they always finish in-order.
I don’t know how accurate the event.profile is, but testing real time by python’s time(), 2 kernels takes exactly 2x the time of 1 kernel (up to 1% variation). I tried various combinations of 2 kernels on 1 queue, 2 queues, 2 contexts & kernels, and wait/no-wait. Asking all CL functions to wait is perhaps <0.5% slower than no waits, but usually both are identical. I’m not sure why when non of the functions block. Perhaps data is too small/fast.
Input is 100k 3D points, the kernel has a forloop of 20k repeats of the same L2-norm operation. That’s 1E09 distance operations. 20k repeats don’t use any extra dataspace, but increasing repeats or the points, causes Fermi to black out and crash the driver–exceeded memory? 1 kernel takes about 1.71 seconds.
The other combinations behave similarly, adding more time-overhead higher up the hierarchy. I even hastily tried 2 separate python interpreters to guarantee thread independence. 2 kernels should take 3.4 seconds. 2 pythons running 4 kernels take 7 seconds. The 2nd python starts between 1-2 seconds after the 1st.
Besides some simplifications, pyopencl is one-to-one function wrapper to the same C code, and supposedly never blocks until the wait() before printing results. From these it’s clear at least kernels execution is serial on one device. This perhaps happens internally in the GPU/driver as one queue, as a poster suggested (from another topic).
Can a Nvidia developer point to a working sample code of parallel kernels on one device? Or how to use this Fermi feature?