Recoving after a TDR event

I have an application which runs at Windows start-up and uses the GPU (via CUDA) when it is given a “task” to do. My application contains only short kernels and I’m fairly certain it never causes timeouts itself.

However, if a third-party application causes a timeout and the TDR mechanism restarts the video driver then I find that my program is no longer able to perform any CUDA operations - I have to restart my application.

What do I need to do to get CUDA working again without a complete application restart?

Between “tasks” I already free all CUDA resources and I’m pretty sure that every thread that uses any CUDA function calls cudaThreadExit() either explicitly or implicitly (by the thread itself terminating). Yet even if the TDR event occurs between “tasks” my application will still fail when it tries to start the next task.

I am not clear what exactly the situation is. A TDR event is a very intrusive action on the part of the operating system, and therefore leads to destruction of the current CUDA context. But the driver as such recovers (sometimes this can take a few seconds on Windows), so if a third-party application triggers a TDR, you should be able to run a CUDA application later without issues. Is this a situation where the third-party app runs concurrently with your own CUDA application?

Since a TDR leads to the OS “yanking out the floor” from underneath the driver, there is nothing really one can do at application level to protect against TDRs triggered by third parties. The basic goal should be to avoid TDRs, which, in practical terms, means one of the following:

(1) Don’t run apps that trigger TDRs

(2) Increase the TDR time-out limit (the necessary steps are OS specific; it obviously means the GUI can be frozen for longer periods of time)

(3) Don’t run CUDA apps on GPUs that also serve the operating system’s GUI; use a dedicated GPU for compute only

My suggestion would be to detect the condition via a benign call, such as cudaGetDevice, when starting a new task. If that call returns an error, then you can assume that some other application corrupted your cuda context in the interregnum.

In that case, I would try calling cudaDeviceReset(). Then attempt to call cudaGetDevice again. If it does not return an error at that point, you should be able to proceed.

Alternatively, you could try calling cudaDeviceReset() at the completion of every task before the interregnum. Then if you don’t make any cuda calls until you attempt the next task, the first cuda call in the next task should re-establish your “new” context.

I don’t know if either of these will work. If they don’t then I can’t offer any ideas beyond the ones offered by njuffa already.

Basically, yes. My application is always running (it behaves almost as a service) but may spend long periods in a state where it is inactive and has no CUDA contexts allocated. A third-party application might run (and cause a TDR) during one of these periods of inactivity.

In theory, a third-party application might also run when my application is active. In this case I would ideally like to recognize that my CUDA context has been trashed, abort what I was doing and try to start over at some later time.

Essentially I’m writing an application for an end user and cannot control what else they might run on their PC but I would like to avoid it looking as though my application is at fault.

OK, so your app runs in the background for indefinite periods of times (your case may be similar to the GPU-accelerated Folding@Home app, for example). I am confused by the statement that there is “no CUDA context allocated”? If there is no CUDA context, how is the app affected by a TDR event? If a TDR event happens while there is no CUDA context, it should have no impact on CUDA contexts created subsequently, after the driver has recovered from the TDR.

This particular version of my application still uses CUDA 2.3 which lacks the cudaDeviceReset() function. I do have an updated version using CUDA 7.5 though so I could investigate that.

Ideally I would prefer to avoid cudaDeviceReset() because it affects the entire process. Currently the part of my application that uses CUDA is in a library and it isn’t particularly friendly to call cudaDeviceReset() from within a library in case my process is also using other libraries which also rely on CUDA. I could get into a situation where several libraries in my process all try to call cudaDeviceReset() after a TDR and end up in a vicious cycle of trashing each others contexts. This isn’t an insoluble problem but the solution would be less than elegant.

After a TDR has occurred, if I call cuCtxCreate() then I get cudaUnknownError (= 999).

I should say that I am mostly using the CUDA Runtime API but I briefly use the CUDA Driver API to query some statistics on memory availability. I always call cuCtxDetach() as soon as I have those statistics and then switch to the CUDA Runtime API.

I cannot speak to CUDA 2.3, that is such ancient history that I do not recall any specifics about it. If you are using the runtime API and the code has no call to cudaDeviceReset(), my best guess is that, once created, your app has in fact a CUDA context allocated at all times, because the runtime API does not have other ways of deallocating existing contexts if I recall correctly. That always active context would be destroyed by a TDR, which then causes the problems you are encountering.

cudaDeviceReset() releases all GPU resources owned by the current process, so one needs to be careful with that in a multi-threaded application. However, other processes using CUDA should not be affected by such a call. I agree that sticking calls to cudaDeviceReset() inside library functions is probably not the way to go, just like one wouldn’t usually stick calls to exit() there. This kind of control should happen at the application level.

There are GPU accelerated apps like Folding@Home or BOINC that run in the background continuously, and these have been around for years and are used by hundreds of thousands of people. You may want to have a look at their sources to see how they are dealing with TDRs triggered by third parties.

In CUDA 2.3 there is a function called cudaThreadExit(). The documentation says:

Explicitly cleans up all runtime-related resources associated with the calling host thread. Any subsequent API call reinitializes the runtime. cudaThreadExit() is implicitly called on host thread exit.

I note that this function is deprecated under CUDA 7.5 and the documentation now says:

Note that this function is deprecated because its name does not reflect its behavior. Its functionality is identical to the non-deprecated function cudaDeviceReset(), which should be used instead.

So does this mean that if I want proper context creation and destruction semantics then I have to use the CUDA Driver API?

What about if I create a new thread before making any CUDA calls? What would happen to a CUDA Runtime API context associated with that thread if it terminated without calling cudaThreadExit() or cudaDeviceReset()?

Traditionally, what distinguishes a thread from a process as a unit of program execution is that the latter owns resources, e.g. memory allocations and file handles, while the former does not. So just like a memory allocation, a CUDA context is a resource owned by a process, and therefore shared by all threads within that process.

This fundamental design property does not change based on which API you use to interact with the CUDA context. So a call to cuCtxDestroy() will likewise de-allocate the CUDA context owned by the process, which affects all threads belonging to that process:

I think I have come to the conclusion that the behaviour of CUDA contexts has changed dramatically since CUDA 2.3. I clearly have some studying to do.

As I recall, we changed numerous design details for CUDA 3.0, basically creating the modern CUDA programmers use today. Context handling may have been one of the aspects that changed significantly, but it happened so long ago that I don’t remember any of the details.

I’ve made a quick little test program using only the CUDA 2.3 Driver API:

// TDR.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <cuda.h>

DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
	CUresult result;
	result = cuInit(0);
	printf("cuInit() returned %i\n", result);
	int count;
	result = cuDeviceGetCount(&count);
	printf("cuDeviceGetCount() returned %i, count = %i\n", result, count);
	CUdevice dev = 0;
	result = cuDeviceGet(&dev, 0);
	printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
	CUcontext ctx = 0;
	result = cuCtxCreate(&ctx, CU_CTX_BLOCKING_SYNC, dev);
	printf("cuCtxCreate() returned %i, ctx = %p\n", result, ctx);
	result = cuCtxDestroy(ctx);
	printf("cuCtxDestroy() returned %i\n", result);
	return 0;
}

int _tmain(int argc, _TCHAR* argv[])
{
	while(1)
	{
		int ch = _getch();

		switch(ch)
		{
		case 't':
			{
				HANDLE handle = CreateThread(0, 0, ThreadProc, 0, 0, 0);
				WaitForSingleObject(handle, INFINITE);
				CloseHandle(handle);
				continue;
			}
		case 'x':
			{
				break;
			}
		default:
			{
				continue;
			}
		}
		
		break;
	}

	return 0;
}

This is the output when I press ‘t’, ‘t’, cause a TDR and press ‘t’ and ‘x’:

cuInit() returned 0
cuDeviceGetCount() returned 0, count = 2
cuDeviceGet() returned 0, dev = 0
cuCtxCreate() returned 0, ctx = 00000000020096B0
cuCtxDestroy() returned 0
cuInit() returned 0
cuDeviceGetCount() returned 0, count = 2
cuDeviceGet() returned 0, dev = 0
cuCtxCreate() returned 0, ctx = 00000000020096B0
cuCtxDestroy() returned 0
cuInit() returned 0
cuDeviceGetCount() returned 0, count = 2
cuDeviceGet() returned 0, dev = 0
cuCtxCreate() returned 999, ctx = 0000000000000000
cuCtxDestroy() returned 1

I’m going to try the CUDA 2.3 Runtime API as well but I’m guessing it will be much the same.

Hopefully I’ll have more luck with CUDA 7.5.

Things seem to be just as bad under CUDA 7.5:

// TDR75.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <cuda.h>

int _tmain(int argc, _TCHAR* argv[])
{
	while(1)
	{
		int ch = _getch();

		printf("%c\n", ch);

		switch(ch)
		{
		case 'i':
			{
				CUresult result;
				result = cuInit(0);
				printf("cuInit() return %i\n", result);
				continue;
			}
		case 'r':
			{
				CUresult result;
				CUdevice dev = 0;
				result = cuDeviceGet(&dev, 0);
				printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
				result = cuDevicePrimaryCtxReset(dev);
				printf("cuDevicePrimaryCtxReset() returned %i\n", result);
				continue;
			}
		case 'p':
			{
				CUresult result;
				CUdevice dev = 0;
				result = cuDeviceGet(&dev, 0);
				printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
				CUcontext ctx = 0;
				result = cuDevicePrimaryCtxRetain(&ctx, dev);
				printf("cuDevicePrimaryCtxRetain() returned %i, ctx = %p\n", result, ctx);
				continue;
			}
		case 'f':
			{
				CUresult result;
				CUdevice dev = 0;
				result = cuDeviceGet(&dev, 0);
				printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
				result = cuDevicePrimaryCtxRelease(dev);
				printf("cuDevicePrimaryCtxRelease() returned %i\n", result);
				continue;
			}
		case 'q':
			{
				CUresult result;
				CUdevice dev = 0;
				unsigned int flags = 0;
				int active = 0;
				result = cuDeviceGet(&dev, 0);
				printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
				result = cuDevicePrimaryCtxGetState(dev, &flags, &active);
				printf("cuDevicePrimaryCtxGetState() returned %i, flags = %u, active = %i\n", result, dev, flags, active);
				continue;
			}
		case 't':
			{
				CUresult result;
				int count;
				result = cuDeviceGetCount(&count);
				printf("cuDeviceGetCount() returned %i, count = %i\n", result, count);
				CUdevice dev = 0;
				result = cuDeviceGet(&dev, 0);
				printf("cuDeviceGet() returned %i, dev = %i\n", result, dev);
				CUcontext ctx = 0;
				result = cuCtxCreate(&ctx, CU_CTX_BLOCKING_SYNC, dev);
				printf("cuCtxCreate() returned %i, ctx = %p\n", result, ctx);
				result = cuCtxDestroy(ctx);
				printf("cuCtxDestroy() returned %i\n", result);
				continue;
			}
		case 'x':
			{
				break;
			}
		default:
			{
				continue;
			}
		}
		
		break;
	}

	return 0;
}

The first attempt (with a TDR between ‘f’ and ‘r’):

i
cuInit() return 0
p
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxRetain() returned 0, ctx = 0000000000372A40
q
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxGetState() returned 0, flags = 0, active = 0
f
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxRelease() returned 0
r
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxReset() returned 0
p
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxRetain() returned 999, ctx = 0000000000000000
t
cuDeviceGetCount() returned 0, count = 2
cuDeviceGet() returned 0, dev = 0
cuCtxCreate() returned 999, ctx = 0000000000000000
cuCtxDestroy() returned 1

Another attempt (with a TDR between ‘i’ and ‘p’):

i
cuInit() return 0
p
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxRetain() returned 999, ctx = 0000000000000000
r
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxReset() returned 0
p
cuDeviceGet() returned 0, dev = 0
cuDevicePrimaryCtxRetain() returned 999, ctx = 0000000000000000

So basically, after my first call to cuInit(), if another application causes a TDR then my process is unable to ever use CUDA again.

I have also taken a look at Folding@Home as suggested by njuffa.

It would appear that each work unit provided by the server is run in a separate process. Separate GPUs are treated as different computing slots (although all of the CPUs and CPU cores are bundled into a single slot). Independent work units (and therefore processes) are assigned to each slot.

If a third-party application causes a TDR then the work unit process running on GPU 0 exits with an error. The work unit process running on GPU 1 appears to be unaffected (presumably because I don’t have a display connected to GPU 1). The client fetches a new work unit for GPU 0 and of course it is started in a new process.