Increased CPU usage with last drivers starting from 270.xx and continue with 285.xx

I see increased CPU usage (~100% CPU core allocation)for my OpenCL program starting from 270.xx driver version.
It uses quite little CPU with 263.06 driver.

What can be done to workaround this issue or when one can expect fix for this issue in driver ?

I’ve been seeing this for the past few release too. I can work around it by manually polling the event to see if the kernel has finished and sleeping, but I’d rather not have to.

Any progress in getting the 100% CPU usage Bug with 270.xx and later drivers fixed? I’m having to leave a Core free to fully utilse Raistmer’s OpenCL app with the 290.53 drivers,
AMD fixed their 100% CPU usage Bug after only 2 driver releases with Cat 11.9, Nvidia have released ~15 drivers since this Bug was introduced,
(Not tried the 295.xx drivers yet, they have a Nasty Bug where the Cuda device disappears when the DVI connected monitor goes to sleep)

Claggy

Its the same in the 295.73 drivers as well - Nvidia app reports usage of a whole core, the same app in it’s AMD variant reports usage of only 2-3%

Jamie

Hi Claggy,

What app is this? Can we examine the source code?

As you know, 100% CPU can mean a number of things.

It’s Raistmer’s Nvidia OpenCL Astropulse app for use on the Setiathome project,

Claggy

App sources are available from Berkeley’s SVN repository (read access granted for Anonymous access too): https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
In short, there is no 100% usage on older NV drivers, no high CPU usage on current ATi drivers.
Bug was reported to NV more than 2 months ago and since then nothing changed (bug was accepted in work but no results or bugfix so far).

Have a look in the CUDA_Toolkit_Reference_Manual at the section about cudaSetDeviceFlags. AFAIK OpenCL doesn’t have a matching function set, so scheduling is probably set to auto, and the heuristics have probably changed:

To quote:

cudaDeviceScheduleAuto: The default value if the flagsparameter is zero, uses a heuristic based on the
number of active CUDA contexts in the processCand the number of logical processors in the systemP. If C>
P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while
waiting for results and actively spin on the processor.

cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can
decrease latency when waiting for the device, but may lower the performance of CPU threads if they are per-
forming work in parallel with the CUDA thread.

cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This
can increase latency when waiting for the device, but can increase the performance of CPU threads performing
work in parallel with the device.

AMD probably always yields, while on modern processors, it seems that CUDA (I’m guessing that OpenCL is the same) will almost always spin rather than yield.
You can try creating more contexts than cores and see if that changes behavior (assuming that OpenCL works the same and doesn’t just always spin or unite contexts)

Thanks for idea.

Can number of active contexts be increased with runing many instances of app? Cause there is only 1 context per app instance used.

We will try to make number of app instances bigger than number of CPUs in system and report results…

Pity than NV did not expose control of scheduler behavior to OpenCL apps. They could at least use some environment variable to instruct runtime to use one or another method.
AMD uses such variables for ISA/IL dumping control for example so it’s known practice…

Not sure how NVIDIA will treat that. It’s probably better to try to create several OpenCL contexts in the same app and just leave them unused. If your device is not in exclusive mode, you can create as many contexts as you want on it, if you don’t submit any kernels, they will not take any computing resources.