100% CPU usage when running CUDA code

__global__ void do_something(float* p, int n)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    float x = 1.0 / (i + 1);
    for(int j = 0; j < n*n; ++j)
        x = cos(x);
    if(i < n)
        p[i] = x;
}


int main()
{
    int num_threads = 1024;
    int n = 8 * num_threads;

    float* p;
    cudaMalloc(&p, n * sizeof(float));

    do_something<<<n/num_threads, num_threads>>>(p, n);

    cudaDeviceSynchronize();
}

When I run this, top and htop show 99-101% CPU usage:

$ time ./a.out
real    0m12.921s
user    0m10.819s
sys     0m1.996s

Is the CPU really busy when it’s waiting for the kernel to finish, or is this a measurement artifact? If it is, what is it doing (and is it important)?

Yes, it’s actually busy in a polling loop inside the driver function associated with cudaDeviceSynchronize(), waiting for the GPU to finish. In a single threaded scenario, it probably doesn’t matter (what else were you going to have that core do?) But in a multi-threaded scenario, you may prefer some other sort of control-relinquishing scheme, and in theory CUDA offers these, take a look at the documentation:

[url]http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18074e885b4d89f5a0fe1beab589e0c8[/url]

In practice, I haven’t observed much difference depending on flags settings. YMMV. It may be somewhat platform-dependent and CUDA version dependent, and host threading model dependent, as well.

(Ultimately, it may not be easy to differentiate if your host CPU core is busy spinning in a wait loop in the driver function, or if it’s busy because it was released to do the work of some other thread code you had queued up.)

Thanks – that was the part of the documentation I was looking for!

Unfortunately, I also see little difference after adding

cudaSetDeviceFlags(cudaDeviceScheduleYield);

in the beginning of main: Htop still shows 100% usage, although time shows improvement in “user time”:

$ time ./a.out

real    0m12.957s
user    0m7.216s
sys     0m5.642s

I have similar issues, My cpu implementation takes less CPU than my gpu implementation. What are the approaches I can follow to reduce CPU load while running my cuda program?

my experience is that there is little difference in CPU utilization between the various scheduling choices when your CPU has nothing else to do. If you give the CPU lots of other thread work to do, then you should see a throughput difference in the other threads work depending on your scheduling choices.

In my experience, you can’t make the CPU activity level lower, if the CPU has nothing else to do, and it is spinning at a CUDA sync point. If you really want to do something like that, my suggestion would be that instead of doing a CUDA device or stream sync, put your GPU work into a stream, and then in a loop you do cudaStreamQuery alternating with an OS command to put the thread to sleep. You decide what level of responsiveness you want/need based on how long you put the CPU thread to sleep.

Here is an example. When I compile the following code and run it normally, I witness about 98% CPU utilization by the process, in top:

# cat t72.cu
#include <unistd.h>

const size_t delay = 8000000000ULL;

__global__ void k(){
  size_t start = clock64();
  while (clock64() < start+delay) {};
}


int main(){

  cudaStream_t s1;
  cudaStreamCreate(&s1);
  k<<<1,1,0,s1>>>();
#ifdef USE_SLEEP
  bool done = false;
  while (!done){
    usleep(5);
    if (cudaStreamQuery(s1) == cudaSuccess) done = true;}
#endif
  cudaDeviceSynchronize();
}
# nvcc -o t72 t72.cu
# time ./t72

real    0m6.067s
user    0m3.927s
sys     0m2.068s

but when I add in the sleep loop, I witness about 30% CPU utilization in top. I think you could drive it lower by increasing the sleep duration from 5 microseconds to some higher value. This impacts the responsiveness to the completion of the kernel, of course:

# nvcc -o t72 t72.cu -DUSE_SLEEP
# time ./t72

real    0m6.076s
user    0m0.010s
sys     0m2.442s
#

I’m not showing the top output above, YMMV, however you can see there is also an effect on the results reported by linux time.

In my test, if I make the kernel delay for a longer period of time, and increase the sleep interval from 5us to 100us, I witness in top that CPU utilization eventually drops to about 5% for the process.