kernels timeout or hang intermitently

I’m having an intermittent problem with kernels hanging indefinitely. My code
calls several kernels sequentially in a big loop with a few async memory copies
here and there.

for (t=0; t<10000000; t++) {
    kernel_a( ..., stream[0]);
    kernel_b( ..., stream[0]);
    ...
    cuMemcpyDtoHAsync( ..., stream[0]);
    kernel_g( ..., stream[1]);
    cuStreamSynchronize(stream[0]);
    some_cpu_work();
    cuMemcpyHtoDAsync( ..., stream[0]);
    cuCtxSynchronize();
}

The longest of these kernels takes around 10ms to execute. The code will run for
hours (several 100,000 kernel launches) but will eventually hang. I have the
floating point precision defined in a macro so that I can change it as needed,

#define PREC double
PREC * x;
x = (PREC*)malloc(20*sizeof(PREC));

The problem only occurs using ‘double’. Other than these hangs, the code
runs as expected and gives sensible results. If I am running the job on a GPU
with a display attached, I get error 702 when it hangs,

CUerror 702 CUDA_ERROR_LAUNCH_TIMEOUT
This indicates that the device kernel took too long to execute. This can
only occur if timeouts are enabled - see the device attribute
::CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT for more information. The
context cannot be used (and must be destroyed similar to
::CUDA_ERROR_LAUNCH_FAILED). All existing device memory allocations from
this context are invalid and must be reconstructed if the program is to
continue using CUDA.

Otherwise, if timeouts are not enabled (ie. if it is a compute only GPU) then I
get no error but the code still hangs. I’ve tested this with versions 304 and
310 of the driver, CUDA 5 and on Debian and Arch Linux on a few different
machines, all with 3GB GTX580 GPUs. The memory usage is around 700 MB.

Can anyone suggest what sort of problems can cause a kernel to hang like this?
I am completely stuck with how to trouble-shoot.

Is there any chance someone can comment on this? I’m stuck without any idea even how it is possible for a kernel that works fine 99.9999% of the time to just freeze for no apparent reason.

Printing a small message after each kernel or memcpy you cold test, if it’s allways the same kernel that hangs.

Hi vanja,

I’m seeing similar behavior with two of my programs. They both use complex double precision ffts, interspersed with other kernels. The ffts hang once in 1–100 million calls. It happens only on the card which is driving the display, and only with the 300+ drivers on both arch and ubuntu 12.04. Just as you describe, if I disable the timer, the machine hangs indefinitely. The Cuda version does not seem to matter.

cdarby, thanks for your reply! It feels so good to find someone having the same problem!
You mention driver versions, is there a version that you can confirm works with your code? I could cross check against my machines/code. I am having a lot of difficulty trouble shooting because in my case the run time before failure is anywhere from 1 hour to 12 hours. Just now, I had a run fail that had been going for 20 hrs. I know that at some stage in the past everything was working correctly but its very hard to determine at what point that was…

I have a barrage of questions :)
Are you using the driver or runtime API?
What compute capability?
What GPU?
Are you using a single GPU or multiple?
Are you using any other libraries? MPI, OpenMP, pthreads etc.

Could someone from Nvidia comment on what sort of problem we are describing? Is there any specific A can cause B that applies to this?
eg. Accessing an out of bounds address can cause a segmentation fault.

I started seeing the problem after changing from driver 295.xx to 304.xx. CUFFT was so much faster with CUDA 5.0 than with 4.2 that I decided I could live with the occasional hang, so I never went back.

As to your other questions, I’m using the runtime API, the card that is giving the errors is a GTX570, so compute capability is 2.0, at times I’ve had other GPUs, a GTX560ti and a titan, as secondary cards, and the only other library I’m using is CUFFT.

I can unfailingly evoke the problem with an endless loop of ffts interspersed with occasional device synchronizations. fft lengths with prime divisors bigger than 7 fail much more quickly, often within minutes rather than hours or days.

cool, and are you running calculations on more than one GPU?

I can also add that I think I have ruled out temperature as the culprit since I have two varieties of GTX580, several with a good cooler (Gigabyte) that sit at around 60 deg under load and few with a terrible cooler (Gainward) that sits around 75 deg. The problem occurs on both models. I’ve previously has strange intermittent faults start to appear around 78 degrees so I have a 40 cm pedestal fan blowing at the Gainward cards in an effort to keep them below that mark :S

OK, I can confirm that this problem is not present in using the 295.41 driver. Is it possible to have someone from Nvidia comment on this driver regression?

I think I might be having the same problem. I’ve never gotten any CUErrors, but I run all the GPU machines headless, because ages ago I was having problems caused by trying to run X and CUDA analysis on the same GPU.

I am running single-precision 2D FFTs, for very long times (about half a million FFTs per day, for several months’ worth of data). I have ten machines running kernel 2.6.31 with nvidia driver 295.20, each with a single GTX 570, and these all run stably for very long times. Another machine is running kernel 3.1.10 with nvidia driver 295.20, with 4 GTX 570s, and this machine has been stable for very long times with analysis on each GPU simultaneously.

The problems started when I tried to build machines around the new GTX 770 cards. I could not get the 295.20 drivers to work with the new hardware, so I tried 310, 319 and 325 drivers. With both the 2.6.31 kernel and much newer kernels (3.6, 3.7, 3.8), these drivers allowed me to run CUDA analysis on the 770s, but with some problems. First of all, memtestG80 consistently finds errors on the GPU listed as 0. For example, if I put 4 GTX 770s in the machine, GPU0 always shows a few memtest errors, while GPU1-GPU3 show zero errors after thousands of iterations. This is not a fault with the GPU itself, because I can rearrange the GPUs in the PCI slots, and the memtestG80 errors still show up on GPU0. This persists if I remove GPUs completely - with 2 GPUs, GPU0 shows errors while GPU1 does not, with 1 GPU, GPU0 (the only one) shows errors. Worse, the same CUDA analysis code that I run on the other machines with driver 295.20 eventually hangs with the new drivers. I almost always get errors like NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context. For a while I can access the machine via ssh, but I cannot kill the CUDA process. During this time, I cannot run nvidia-smi or memtestG80. Eventually the machine hangs totally. These errors typically take 2-3 days to show up. The cards are not getting above 70C and occurs on cards that have not passed 50C, so this is not a thermal issue.

I am currently trying to run the analysis on all GPUs except GPU0, to see if this is a more stable configuration. These errors are terribly troubling, and I think they have been described in various guises on the linux driver forum, although as far as I can see, the mods on that board have neither acknowledged any issues nor suggested any solutions. If I make any progress I will let you know - thanks for posting this information.

There seem to be some commonalities between the problems observed, such as very long application runtimes involving million of kernel calls, and the use of CUFFT. It would be premature to conclude that these observations are due to a single root cause, however, let alone pinpoint a root cause based on the description.

I recommend filing a bug for each issue, using the bug reporting form linked from the registered developer website. Please attach a self-contained repro app, and provide as much system information and other information as possible (as some setups seem to be fairly complex and may be difficult to reproduce). It will be helpful for the repro app to be simplified as much as possible. Thanks.