Computation crash = stuck at 574mhz

Some times when my computations crash during debugging and a thrust memory exception occurs, my GPU becomes stuck at 574mhz. Is there any way to get it “unstuck” without rebooting or forcing the driver to crash? I typically run computations on multiple GPUs at once.

These crashes can occur anywhere from 20 minutes to 12 hours into a computation, or never (oh the joy of debugging!), so I’d like maximum performance at all times to identify what exactly is happening that is causing every variable to blow up to infinity, crash my computation, cause thrust to get an exception, and have my GPU stuck at 574mhz.

“or forcing the driver to crash?”

good heavens, how do you manage this?

“Is there any way to get it “unstuck” without rebooting or forcing the driver to crash? I typically run computations on multiple GPUs at once.”

if you find one, kindly let me know

i suppose that would be a grand RFE: an api that can reset the device, close to a “shutdown and restart”
i also think this is long overdue

i do not want to blow your bubble, but i doubt whether you are going to find such an “unstucker”
personally, i would therefore focus on mechanisms and methods to identify the cause as much, and as quickly as possible
you may have to build a debug version - a version with extra redundancy for purposes of debugging, predominantly maintained for debugging
for example, one option may be to have the debug version push a trace into memory, such that, if a crash occurs at the point 12 hours 1 sec, the program would be able to recommence at 12 hours in a flash

“I typically run computations on multiple GPUs at once.”

how do you distribute the work?

i have found that, when i distribute the work more aggressively, such that kernels/ devices work on smaller sub-problems or ‘work sets’, and more frequently retire completed work and accept new work, i generally arrive at errors more quickly, and a lot earlier on in the program, should there be any errors left

If you’re using Windows, go to the Device Manager, disable the GPU and then enable it.

nvidia-smi -r is intended to reset a GPU, although it requires root privilege and cannot be used (AFAIK) to reset the “primary” GPU (which I think means a GPU driving a display).

Also, in linux, if you do not have X loaded on the GPU in question, you can do something like

rmmod nvidia

After that, the next CUDA activity should force a driver reload, which should reset the GPU. (this method also requires root privilege)

so, you can not revive a crashed device from within your application…?
the closest solution to this is running a script from within your application?

this seems to champion a ‘make sure it never breaks; immediately fix it when it does’ approach

From within the application:

cudaDeviceReset()

i have to (re)check that - if i am not mistaken, i have been told that cudaDeviceReset only resets the context; cudaDeviceReset is insufficient to revive a device with ‘mad-card-disease’

but apart from that, i would think that the reset is only half of the problem/ solution
what about the monitoring?

do (compute) cards (running complex code) more often than not, or hardly ever, enter a state of perpetual insanity?
is the case of blade613x really such an uncommon case?
what about the context of servers/ clusters?
what if blade613x did not pick this up, and it occurred in the field?

Since I don’t know what mad-card-disease is (or perpetual insanity), I don’t know what is sufficient to revive it.

I’d be willing to bet that none of these methods works in every case. It may be that a reboot is necessary. Isn’t this true of PCs in general?

There are failures that occur in server clusters, with or without GPUs. The checkpoint/restart evolution long preceded GPUs. And sure, checkpointing is used in some cases for orderly shutdowns, but in many cases it is used to recover from that unexplained/unexpected crash.

I was just trying to offer some suggestions of things to try.

“Since I don’t know what mad-card-disease is (or perpetual insanity)”

simply what blade613x, and surely others, have experienced - the distinct case of, should your device hit a bug that you are not aware of, it may crash and become unstable to the point that little less than a reset is necessary to revive it, (and you may not even know that a device is ‘down’)

"I was just trying to offer some suggestions of things to try. "

a) and it was taken in no other way
b) no offense intended
c) your input is appreciated

my view is simply that i perceive the matter to be sufficiently common to warrant further investigation (by nvidia)