host-device latencies?

Doing recently some benchmarks and wonder if my host-device latencies are
bound to my older hardware or are similar on newer systems?

OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~35K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~10K calls per second

Note that my machine is a bit outdated:

  • PCIe via Northbridge
  • PCIe 2.0
  • only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

Thanks in advance,
Srdja

I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.

Thanks, this is exactly what i was looking for.

I can change my design to device based computation with about 1 second per run.


Srdja