Polling device variable while kernel is running

Hello everyone,

I have an application that runs on a GPU for hours or even days, depending on the input. Hence, I would like to have some indication of the percentage of calculations that have been completed at constant time intervals (e.g. every minute).

The application consists of a set of nested loops with known numbers of iterations. As a result, it is relatively easy to count on the device in a global variable how many iterations have already been calculated and divide that number by the total number of iterations.

But to print the percentage of completion, I need some way on the host to poll that global variable at specific time intervals while the kernel is running. How can I do this continuously on the host? What I need to do on the host is described by the following pseudocode:

call kernel

do
  wait for time interval to pass
  poll device variable
  print "Percentage completed is ..."
while (kernel is running)

copy output data from device to host

My issue is what function of the CUDA API I have to use to read the value of the device variable without blocking execution on the host until the kernel finishes.

Any suggestions?

There are a number of concepts involved, including streams and zero-copy memory.

Here is a fully -worked example:

[url]cuda - How can I check the progress of matrix multiplication? - Stack Overflow

Somehow I missed that example. Probably did not use the appropriate search terms in Google :-)

I will have a look at it and come back if I have further questions.

Thanks!

I managed to create a working version, based on the above example. I used the same logic of increasing a counter after each block of threads finishes execution, as this proved to be sufficiently accurate for my purposes.

However, I am not completely happy with this solution and I have a few more questions about how to improve it. Before I ask them, let me present my solution so that we have something more concrete to discuss about.

My kernel looks like this:

__global__ void findSolutions(<Many problem-specific parameters>, volatile unsigned long long *progress)
{
  // Do work here

  // I use a 1D grid with 1D blocks of threads.
  // Only the thread with ID 0 in each block
  //  increases atomically the progress counter.
  if (threadIdx.x == 0) {
    atomicAdd((unsigned long long *)progress, 1);
    __threadfence_system();
  }
}

The code in the main() function is as follows:

volatile unsigned long long     *progress_h, *progress_d, currValue;
double                          currProgress, prevProgress = 0.0;
cudaError_t                     err;

err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (err != cudaSuccess) {
  printf("cudaSetDeviceFlags failed.\n");
  exit(1);
}

err = cudaHostAlloc((void **)&progress_h, sizeof(volatile unsigned long long), cudaHostAllocMapped);
if (err != cudaSuccess) {
  printf("cudaHostAlloc failed.\n");
  exit(1);
}
err = cudaHostGetDevicePointer((void **)&progress_d, (void *)progress_h, 0);
if (err != cudaSuccess) {
  printf("cudaHostGetDevicePointer failed.\n");
  exit(1);
}

*progress_h = 0;

findSolutions<<<blocksPerGrid, threadsPerBlock>>>(<Other parameters>, progress_d);

do {
  currValue = *progress_h;
  currProgress = (double)currValue / (double)blocksPerGrid;
  // Print percentage only if there has been more than 1% progress.
  if (currProgress - prevProgress > 0.01) {
    printf("%5.1f%%\n", currProgress * 100.0);
    prevProgress = currProgress;
  }
} while (currProgress < 1.0);

Now my questions:

  1. How do I check under the above scheme that the kernel actually launched? I understand that normally cudaGetLastError() is called to check for this immediately after calling the kernel. But if cudaGetLastError() is blocking, execution will not reach the loop to show progress.

Furthermore, I am not certain what the following note from the cudaGetLastError() documentation might mean in the context of my code and whether some error caused by another CUDA API call could be misinterpreted as an error of the kernel not being launched:

“Note that this function may also return error codes from previous, asynchronous launches.”

  1. I am not very happy with the fact that I check variable currProgress to see whether execution of the kernel reached 100%. I would be much more comfortable to use some CUDA specific API to check whether the kernel actually finished execution or not. Reading through several posts here and on other forums I think that a combination of cudaEventRecord() and cudaEventQuery() can provide the necessary functionality I need. However, I have not been able to find a fully functional example about how to use them, i.e., I am not certain at which points in my code I need to call them. Does anyone know a link to a good example?

Furthermore, does somehow use of these two functions interact with the use of cudaGetLastError() to check whether the kernel launched or not?

  1. The loop spins continuously using 100% of the CPU all the time. Would using some sleep function (sleep, usleep, nanosleep) cause any problem in the execution of the code (apart from some small latency in reacting to kernel completion)? Is usage of any of the sleep functions appropriate? Or is there some other CUDA specific way to do this?

Thank you in advance for your comments on the above.

  1. cudaGetLastError is not blocking, and can be used directly after a kernel launch to verify that the launch itself was successful. It does not verify that the kernel successfully completed unless you use a blocking call like cudaDeviceSynchronize before the call to cudaGetLastError

  2. You can use a method like this:

[url]Launch CUDA-Kernel with a timeout - Stack Overflow

to verify that the kernel has completed (and optionally kill it).

  1. I’m sure you can use a CPU thread based method to sleep instead of spinning on the progress variable. Such as the usleep function used in the example above.

Dear txbob,

Thank you for your to-the-point answers and links. I managed to make it work as I wanted. Just as a reference, in case someone searches for something similar, I quote the new code I have created.

The kernel didn’t change:

__global__ void findSolutions(<Many problem-specific parameters>, volatile unsigned long long *progress)
{
  // Do work here

  // I use a 1D grid with 1D blocks of threads.
  // Only the thread with ID 0 in each block
  // increases atomically the progress counter.
  if (threadIdx.x == 0) {
    atomicAdd((unsigned long long *)progress, 1);
    __threadfence_system();
  }
}

The main() function now includes the following code:

volatile unsigned long long     *progress_h, *progress_d, currValue;
double                          currProgress, prevProgress = 0.0;
cudaError_t                     err;
cudaEvent_t                     kernelFinished;

err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (err != cudaSuccess) {
  printf("cudaSetDeviceFlags failed.\n");
  exit(1);
}

err = cudaEventCreate(&kernelFinished);
if (err != cudaSuccess) {
  printf("cudaEventCreate failed.\n");
  exit(0);
}

err = cudaHostAlloc((void **)&progress_h, sizeof(volatile unsigned long long), cudaHostAllocMapped);
if (err != cudaSuccess) {
  printf("cudaHostAlloc failed.\n");
  exit(1);
}

err = cudaHostGetDevicePointer((void **)&progress_d, (void *)progress_h, 0);
if (err != cudaSuccess) {
  printf("cudaHostGetDevicePointer failed.\n");
  exit(1);
}

*progress_h = 0;

/*
 * Call the kernel.
 */
findSolutions<<<blocksPerGrid, threadsPerBlock>>>(<Other parameters>, progress_d);

/*
 * Check that the kernel actually launched.
 */
err = cudaGetLastError();
if (err != cudaSuccess) {
  printf("CUDA error: %s\n", cudaGetErrorString(err));
  exit(0);
}

/*
 * Record when the kernel finishes execution.
 */
err = cudaEventRecord(kernelFinished);
if (err != cudaSuccess) {
  printf("cudaEventRecord failed.\n");
  exit(0);
}

/*
 * Print progress of calculations every second and only if more than 1% has been completed.
 */
printf("Progress of calculations:\n");
printf("  0.0%%\n");
fflush(NULL);
while (cudaEventQuery(kernelFinished) != cudaSuccess) {
  usleep(1000000);
  currValue = *progress_h;
  currProgress = (double)currValue / (double)blocksPerGrid;
  if (currProgress - prevProgress > 0.01) {
    printf("%5.1f%%\n", currProgress * 100.0);
    fflush(NULL);
    prevProgress = currProgress;
  }
}

cudaDeviceSynchronize();
err = cudaGetLastError();
if (err != cudaSuccess) {
  printf("CUDA error: %s\n", cudaGetErrorString(err));
  exit(0);
}

printf("Calculations finished\n");
fflush(NULL);

I typically redirect the output of the program to a file, hence I use fflush() to immediately update the file and see the progress.

I also artificially made the kernel fail to launch (allowed it to use too many registers for the number of threads per block) and indeed this is correctly caught.

Thanks again for all your help.