"Display driver stopped responding and has recovered" WDDM Timeout Detection and Recovery

Have you seen this error message while running a CUDA program under Windows Vista or Windows 7?

External Image

This message is telling you that the Timeout Detection and Recovery (TDR) feature of Windows Vista’s driver model (WDDM) has been triggered because the CUDA kernel (or batch of kernels*) you were running took longer than to complete than the configured timeout period allows. By default, this timeout period is two seconds in Windows Vista and Windows 7.

*This can even happen with really short-running kernels if there are a lot of such kernel launches queued up on each others’ heels, because in some instances the driver may batch up several kernel launches and submit them all at once, at which point WDDM requires all of the launches in the batch to run to completion within a single timeout period.

In Windows XP, there was a similar (though longer) timeout, which if exceeded would typically bugcheck (blue screen) the machine. The machine would typically appear frozen or hung until the timeout was reached. To make this failure mode more user-friendly, Microsoft reduced the timeout to 2 seconds starting in Windows Vista and introduced this driver recovery process. While useful for typical interactive graphics applications, this can be problematic for non-graphics (compute) kernels. This is especially true when you have a kernel that might run in far less than two seconds on a higher-end GPU but that takes longer on a lower-end GPU.

Microsoft has more information about the TDR mechanism and how to configure it on their website at http://www.microsoft.com/whdc/device/displ…dm_timeout.mspx . Note that if you change the registry keys described on that page (e.g., to increase the timeout period or to disable the timeout mechanism entirely), you MUST REBOOT before the registry setting changes take effect.

If changing registry keys is not an option for you, you will need to split up your kernels into pieces that you can be certain will run within 2 seconds even on the lowest-end GPUs you expect your application to run on; you could have this be different per GPU for example by scaling the runs based on the number of SM’s in the GPU.

[This issue is also mentioned in the Known Issues section of the CUDA Toolkit’s release notes.]

I would like to know if this is also a problem, or how it is handled, in Windows Server 2008.

Is there any effective difference between bypassing the Timeout Detection and Recovery using the registry hack and running under Linux, which apparently does not check the timeout?

This problem is happened for me and It automatically restarts my pc.Then ,It not allow me to the desktop .Whenever , I start it,it automatically restarts .Then ,I have uninstalled the video drivers .Now,it allows me to do the operations as before .But ,the screen appears like greenish and brown lines are appearing on the screen .

I am using windows 7 professional and I am not installed any games in my system.

So,what can i do to get the graphics in the desktop as before …?

Thanks for your answers .

I’ve faced the problem in several games and applications. Example: Games running PhysX engine if PhysX is processed by GPU. Alice Madness Returns, Mafia II, Mirror’s Edge, etc are using PhysX engine. The game hangs on Windows 7 32bit and Timeout Detection and Recovery in Windows 7 64bit. After TDR it will be resumed in 1-2second and the game is now running without PhysX support. For Windows 7 32bit it will go freezing mode, no recovery, only kill it by task manager.

Applications like iray also using cuda power to render 3ds Max scenes. it will run low end GPUs about 30-60 seconds maximum, at that moment the GPU utilization will be 99%-100% and after certain period it becomes non-responsive. the windows will then TDR the GPU (windows 7 64bit) or cuda will stop computing (windows 7 32bit). in the mean time the host application will wait for the computational result from cuda threads and go to sleep or freezing state. now there is only way to kill the program by task manager. it causes major data loss in computation. for games the auto save points will never be reached.

CAUSE OF PROBLEM: Massive amount of cuda computations are using without manage the threads. threads are acting like infinite loops. but in real scenario it is common that simulation engines and robust system will use more than 2 seconds calculation. if i calculate it by CPU, it will run in slow speed but have ability to run for hours or days. Modern OS have Deadlock Handling features, Process Priority management. but cuda drivers and apis don’t have that kind of feature yet. resulting program crash and data loss.

For example if a NV GPU have 96 cuda cores and can perform a complex calculation in 5 seconds other hand a high performance NV GPU will compute same amount of calculation in seconds.

This is not programming fault, it is a kernel and driver level fault. nvidia releases common driver for same generation GPU and if it will tune the performance for mid range than it will work fine for mid to high range and fail for below mid range.

I think there should be some mechanism in driver or kernel to handle deadlock and long computational threads for its cuda cores.

  1. There should be a reserved amount of processing power which will signal OS that it is working and responsive.

  2. Thread Priority Management for cuda cores which will not interfere with other applications.

  3. Ignore OS level TDR and recover itself to protect data loss.

  4. cuda programmers should calculate data in chunks rather than full set. this will free the GPU for a while to response.

I’ve checked these scenario with my Geforce GT 540M (1GB) Laptop GPU and Geforce 9500GT (512MB) Desktop GPU. I’ve also tried coding cuda program with large amount of loops ( not infinite ) but it always have the limit.

I’ve just shared my experience here and hope nvidia engineers will look forward to this situation.

Thanks,

CUDA error iray with 3ds max 2012 ( screenshot and system log ). Target Machine Geforce GT 540M (1GB), Intel Core i3 2.53GHz, 4GB DDR3, Windows 7 32bit

NVIDIA System Monitor Event Log

Time Memory / Memory Usage GPU / GPU FB Usage GPU / GPU Usage GPU / GPU Temp CPU / CPU2 Tj Temp CPU / CPU1 Tj Temp CPU / CPU1 Usage CPU / CPU2 Usage
Sun Aug 07 02:27:57 2011 77 % 51 % 90 % 61 °C 77 °C 74 °C 78 % 80 %
Sun Aug 07 02:28:07 2011 77 % 58 % 99 % 64 °C 81 °C 78 °C 90 % 66 %
Sun Aug 07 02:28:17 2011 77 % 58 % 99 % 66 °C 81 °C 79 °C 84 % 76 %
Sun Aug 07 02:28:27 2011 77 % 57 % 99 % 67 °C 84 °C 81 °C 93 % 73 %
Sun Aug 07 02:28:37 2011 77 % 56 % 98 % 68 °C 85 °C 79 °C 90 % 73 %
Sun Aug 07 02:28:47 2011 76 % 56 % 97 % 70 °C 87 °C 83 °C 80 % 77 %
Sun Aug 07 02:28:57 2011 76 % 58 % 99 % 70 °C 87 °C 83 °C 87 % 64 %
Sun Aug 07 02:29:07 2011 76 % 57 % 99 % 71 °C 87 °C 81 °C 54 % 63 %
Sun Aug 07 02:29:17 2011 76 % 0 % 0 % 66 °C 80 °C 80 °C 39 % 48 %

Hello Jamil,

I’m afraid it’s not as simple as somehow tuning our drivers better for low-end GPUs (and thereby getting all of these kernels to run inside of the timeout period), nor is it as simple as “ignoring” TDRs. Neither of these is really possible. A kernel takes however long it takes to complete on a particular GPU. If on some GPU that time happens to be longer than the timeout period, and if the WDDM driver mode is in use (i.e., you’re on Windows Vista, 7, 2008, etc., and you’re doing your computation on a GPU that is a (potential) display device), then the operating system will trigger a TDR, and the driver is powerless to stop it.

There is also no amount of “processing power” in use that could signal the operating system that the GPU and/or driver has not hung; this is undecidable ([url=“Halting problem - Wikipedia”]http://en.wikipedia.org/wiki/Halting_problem[/url]), hence an arbitrary timeout is the only solution. If a programmer cannot know for certain that a given task has hung, his/her only choice is to set up a timeout to have the computer wait as long as (s)he thinks it should take (and maybe just a little longer), and if it’s not done by then, assume it has hung. Two seconds as the default timeout here happens to work well for interactive graphics, since no frame should ever take as long as two seconds to complete; if it isn’t complete by then, the operating system assumes it has hung.

However, as discussed in this thread, two seconds might not be a sufficient timeout period in which all non-graphics computation might be assumed to complete. That is why the registry keys described on the Microsoft webpage I linked to at the beginning of this thread exist: to allow the user to configure the timeout period to suit his/her needs.

Hope this helps,
Cliff

Hi Cliff Woolley,

Thanks for your explanation. I know the situation is not so easy to overcome and until then the technology is unstable. we can’t use many high performance application or play games using cuda for data loss. even i am worry to use cuda in my own applications.

Nvidia GPUs have enough power to complete the Graphics Render in times. but for cuda solutions driving me not to use cuda technology. I’ve tried a lot but nothing working for me to use cuda powered applications. could you please provide any specific suggestion for me?

you can see in my previous post that i couldn’t render a scene and after that i have to close 3ds max. Every time i close i have to sacrifice some of my work.

At least nvidia try to give some settings where we can set cuda process priority or control process prioritylike windows task manager.

it is common to multi task with one device and the device driver must be stable to handle.

references:

Thanks,

It doesn’t work that way. Once an application begins a batch of work on the GPU under WDDM, the batch must be completed before any other work from other applications (or the OS) can take place on the GPU. There is no preemption. Hence there can be no notion of a “priority”.

Graphics applications (games, etc.) are able to work well across a range of GPUs by way of tunable levels of detail. If you’re on a low-end GPU, you simply render at lower resolution and/or with lower levels of antialiasing, anisotropic texture filtering, etc. This reduces the amount of work the GPU has to do for any particular frame such that the time to completion of the frame goes down enough to bring the frames/second up to a reasonable level. If you tried to render with the same level of detail that you use on a high-end GPU on your low-end GPU, you would experience unacceptably low frame rates. This is basically another instance of the same thing.

There are two conceivable solutions: (1) either the application/library/etc could be written to have selectable levels of GPU Computing “detail” (to the extent that even makes sense for the particular application) – though of course this requires direct support by the app; there’s nothing the driver can do on its own to accomplish this. (2) You, the user, can increase the amount of time allotted to complete the task.

I note again that this is not a CUDA-specific idiosyncracy; it is an intentional design limitation of the Windows Display Driver Model (WDDM). WDDM’s timeout wasn’t designed with GPU Computing in mind, since it’s not the most common use case for most consumers. If you want to do GPU Computing on Windows Vista+ using the WDDM driver and find the timeout too short, simply change the timeout; it’s that simple.

Hope this helps,

Cliff

Is there any documentation that elaborates on when the driver batches kernel launches, described in the asterisk? I have just switched to developing on a machine with a single GPU and the naive ways I have attempted to avoid the WDDM timeout have failed for reasons which I cannot understand. For example, this code attempts to put a one second window between sequential executions by splitting the data along the dev_row_list axis:

int iteration_size = 10000;

for(i = 0; i < 1+row_list_length/iteration_size; i++)

{ split_gpu<<<1+cs_length/256,256>>>(dev_data, dev_row_list+i*iteration_size, dev_cs, row_length, iteration_size, cs_length, dev_count);

Sleep(1000); }

No Sleep(x) interval or iteration_size I have attempted allows this to run consistently; rather, the full execution on 100k items sometimes takes less than 2.5 seconds, and in these cases, even iteration_size=100k will not trigger the WDDM timeout. Am I splitting my executions in the wrong way? Short of shuttling my data across the PCI-e bus for each iteration and caching intermediate results in system RAM, I cannot think of another way to obtain more independent kernel launches. Any advice or pointers to documentation would be much appreciated.

I would add, in my case, another behaviour I would not expect is that the timeout always occurs not close to the five-second mark, but rather, after all of the iterations have completed and before I copy the device memory back to the host. I find this also difficult to explain.

Thanks,

After many weeks of Googling, going through 100 threads, trying many of the solutions (some of which being ridiculous like reformatting, replacing hardware, etc.), I found ONE POST that suggested what I can now confirm fixes this problem. Not “it hasn’t happened yet, keeping my fingers crossed”, but actually fixed.

THE SOLUTION:

Nvidia Control Panel → 3D Settings → Set PhysX configuration → Select a Physx processor → choose your graphics card instead of leaving it on auto-select.

I just signed up to say this is genius. and I hope it works because this problem has been driving me insane for months now.

took awhile but the fail message came back…tried some other things

see what happens now…

Hi guys,
I just registered to share my experience with this issue.
“nividia display drivers have stopped responding” started appearing for me early this week after I have done 2 things:

  • I upgraded my nvidia drivers to the 285.62 - WHQL
  • I installed Adobe After effects C5.5

I got rid of the problem.
I rolled back to my previous drivers 280.26 and I un-installed all the unwanted bloat-ware that adobe installed “Adobe air” “Adobe Media Player” and maybe an other one, I apologies for being unable to be more precise, but I went from 6~8 freezes per day to none.
It may be worth to look into.

Also for the people whom like I, were loosing renderings that their 3d application were calculating, I switched the display settings on mine (Cinema4D)from OpenGL to software, this had no speed incidence on render speed, but when the nividia display drivers problem occurred (before I fixed it), at least it didn’t interrupted the rendering any longueur.

I just signed up to THANK YOU THANK YOU THANK YOU!!! I can confirm that the fix worked on my system (i7 2600k with nVidia 560.) It’s been 3 days without my screen going black and then the “nvidia display drivers have stopped responding” thing. I can safely say the problem has been solved. Once again SilentBob420BMFJ, you get my sincerest gratitude. This issue was driving me insane.

Im having the same problem.
I use windows 7 ultimate x86
nvidia gefore 8600gt
my computer will work normally for like 3-5 minutes and then its graphics will crash.

i think i tried the physx configuration settings thing and didnt work…thought im not sure il try it again today.
i also tried uninstalling my monitor adapter and then reinstall but it still didnt work…
Is this a software problem or a hardware problem? Should i buy a new graphics card?

Im having the same problem.
I use windows 7 ultimate x86
nvidia gefore 8600gt
my computer will work normally for like 3-5 minutes and then its graphics will crash.

i think i tried the physx configuration settings thing and didnt work…thought im not sure il try it again today.
i also tried uninstalling my monitor adapter and then reinstall but it still didnt work…
Is this a software problem or a hardware problem? Should i buy a new graphics card?

Ok here is yet another twist to the problem. I have a 590, I can select A or B or CPU in the control panel solution. Neither fix it. Any ideas?

Ok here is yet another twist to the problem. I have a 590, I can select A or B or CPU in the control panel solution. Neither fix it. Any ideas?

Shame on you NVIDIA. It’s not a joke for people like me to spend $500.00 on one of your products (GTX-580)and to have it continually fail due to this driver issue. This is a considerable investment for many of us. I, unlike many of your other customers who have posted here and all across the web, use your products to make my livelihood as opposed to gaming (with all respect to you fine folks in the gaming community). As an editor and motion-graphics designer, every crash of your driver impacts my bottom line. Every crash loses me my most recent version of a project that I’m working on for a client. And this issue has been dragging on for how long now? At least have the decency to feign concern and to make us believe you’re seriously addressing the problem. Listen to all of these voices across the web. But, no- you’re not listening. Instead you have chosen silence, or to blame Microsoft and Windows 7. Your response to this problem has been inexplicable and inexcusable. Amazingly enough, your driver has given me enough time to write this without crashing! Will wonders never cease.

I was able to fix the problem… however i had a virus k32.exe which was (the version i had) a bitcoin miner. This virus for some reason actually causes GPU overclocking among other nasty things it was probably doing like stealing my money and secret cooking recipes. The odd thing was the virus came with a help file which explained that is was a GPU overclocker. As soon as i removed K32.exe and the 4 folders of crap it created on my drive-- (which AVAST just completely didn’t care about and politely made firewall exceptions for even though my security settings are at max) – the error went away.

I would imagine that this same error could be caused by a variety of things, but gpu overclocking seems to be one possibility that should be looked at whether from a virus or form some sort of system tweak or whatever. Hope that helps someone. And if you see k32.exe (or anything that says bitcoin miner on a system search) running, get rid of it. Your virus protection may not be finding it on its own.

Using i7 950 / Windows 7 / GTX 470

gtx460 running in windows 7 64bit with same issue. tried everything i found online with no help. tried almost every driver nvidia offered, but no fix. my most recent scenario has been running steady with no issue for few days now. I know i should only change one variable at a time and test it, but i don’t have that kind of time. Anyway, i made two changes to my setup that seem to have fixed my issue.
1.I downgraded the driver to 266.58
2. In NIVIDIA Control panel, manage 3d settings, program settings tap, select Microsoft Internet Explorer, and set power management mode to maximum performance

I have a strong feeling it was number 2 that fixed my issue, but i did not do a test to confirm as i am happy to have a stable PC.

I would like to remind everyone that this thread is in a forum related to CUDA development on Windows. If you’re not a CUDA developer, then my advice in the original post on how to change TDR settings will not be valuable to you. It was meant to help those who intentionally want to let their CUDA kernels run for long periods of time on their display devices at the expense of interactivity.

For users of graphical applications, of course, this advice is neither applicable nor valuable: for them, the TDR serves a valuable purpose, which is to prevent their system from becoming unresponsive should something go wrong. Unfortunately, the occurrence of a TDR alone is not sufficient information to know what went wrong, only that something did. For those of you to whom this applies, I encourage you to head over to the graphics-related NVIDIA forums and discuss your issues there. Important information will include your exact computer and GPU model, your exact driver version, what you were doing at the time of the TDR, whether you see any pattern if you’ve had repeated TDRs, etc.

Hope this helps. I’m closing this thread at this point since it’s gone quite far off topic from CUDA development.

Thanks,
Cliff