I managed to create a working version, based on the above example. I used the same logic of increasing a counter after each block of threads finishes execution, as this proved to be sufficiently accurate for my purposes.
However, I am not completely happy with this solution and I have a few more questions about how to improve it. Before I ask them, let me present my solution so that we have something more concrete to discuss about.
My kernel looks like this:
__global__ void findSolutions(<Many problem-specific parameters>, volatile unsigned long long *progress)
{
// Do work here
// I use a 1D grid with 1D blocks of threads.
// Only the thread with ID 0 in each block
// increases atomically the progress counter.
if (threadIdx.x == 0) {
atomicAdd((unsigned long long *)progress, 1);
__threadfence_system();
}
}
The code in the main() function is as follows:
volatile unsigned long long *progress_h, *progress_d, currValue;
double currProgress, prevProgress = 0.0;
cudaError_t err;
err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (err != cudaSuccess) {
printf("cudaSetDeviceFlags failed.\n");
exit(1);
}
err = cudaHostAlloc((void **)&progress_h, sizeof(volatile unsigned long long), cudaHostAllocMapped);
if (err != cudaSuccess) {
printf("cudaHostAlloc failed.\n");
exit(1);
}
err = cudaHostGetDevicePointer((void **)&progress_d, (void *)progress_h, 0);
if (err != cudaSuccess) {
printf("cudaHostGetDevicePointer failed.\n");
exit(1);
}
*progress_h = 0;
findSolutions<<<blocksPerGrid, threadsPerBlock>>>(<Other parameters>, progress_d);
do {
currValue = *progress_h;
currProgress = (double)currValue / (double)blocksPerGrid;
// Print percentage only if there has been more than 1% progress.
if (currProgress - prevProgress > 0.01) {
printf("%5.1f%%\n", currProgress * 100.0);
prevProgress = currProgress;
}
} while (currProgress < 1.0);
Now my questions:
- How do I check under the above scheme that the kernel actually launched? I understand that normally cudaGetLastError() is called to check for this immediately after calling the kernel. But if cudaGetLastError() is blocking, execution will not reach the loop to show progress.
Furthermore, I am not certain what the following note from the cudaGetLastError() documentation might mean in the context of my code and whether some error caused by another CUDA API call could be misinterpreted as an error of the kernel not being launched:
“Note that this function may also return error codes from previous, asynchronous launches.”
- I am not very happy with the fact that I check variable currProgress to see whether execution of the kernel reached 100%. I would be much more comfortable to use some CUDA specific API to check whether the kernel actually finished execution or not. Reading through several posts here and on other forums I think that a combination of cudaEventRecord() and cudaEventQuery() can provide the necessary functionality I need. However, I have not been able to find a fully functional example about how to use them, i.e., I am not certain at which points in my code I need to call them. Does anyone know a link to a good example?
Furthermore, does somehow use of these two functions interact with the use of cudaGetLastError() to check whether the kernel launched or not?
- The loop spins continuously using 100% of the CPU all the time. Would using some sleep function (sleep, usleep, nanosleep) cause any problem in the execution of the code (apart from some small latency in reacting to kernel completion)? Is usage of any of the sleep functions appropriate? Or is there some other CUDA specific way to do this?
Thank you in advance for your comments on the above.