Synchronization hangs sporadically after kernel launch

I am experiencing a problem similar to the ones described here:
[url]https://devtalk.nvidia.com/default/topic/547899/kernels-timeout-or-hang-intermitently/[/url]
and here:
[url]c - Cuda hangs on cudaDeviceSynchronize randomly - Stack Overflow

Our application processes a large number of small jobs (<1s GPU time/job), distributed on a cluster. Each GPU is shared by multiple workers which acquire exclusive access before each job by obtaining a lock. Each worker runs in its own thread and uses its own CUDA context. A job consists of copying data to the GPU, running multiple kernels and copying the data back. CUDA events are used to synchronize the kernel launches within a stream.

For a very small portion of these jobs (~ one in 10000 to 100000), the cuEventSynchronize() call used to synchronize the last kernel in the pipeline never returns and just hangs indefinitely. While other workers are not actively launching kernels or copy data at the same time, they might use stuff like cuCtxCreate(). As soon as one thread hangs, all other threads will hang too if they call into the driver API. If I add a cuCtxSynchronize() call after every launch, the worker will be stuck there, usually after one specific kernel, which happens to have the longest runtime of the kernels in the pipeline.
This kernel does not contain any data-dependent loops or thread synchronization which could cause deadlocks. While there are shuffle accesses within warps, we already tried to replace them with (synchronized) shared memory accesses, without success.

Since this is a linux setup without X, there is no launch timeout, leading to indefinitely stuck workers without any error. nvidia-smi will report 0% utilization in this case, suggesting that there is indeed no kernel running anymore.
When such a process is killed, all its jobs are distributed to other machines where they can be processed successfully.

The problem doesn’t occur, if I reduce the number of worker threads to one per GPU (still two worker threads, as the machines have 2 GPUs). For me, this, together with the fact that other threads with other contexts can be affected if a worker hangs, suggests that there is something going wrong when some API calls happen concurrently.

I am not aware of any restrictions concerning two threads using different contexts. Can somebody make a definitive statement about that? Are there any driver API calls that must not happen concurrently (in distinct contexts)?

The problem has occurred with any of the combinations between driver versions 340.32 and 346.59 and CUDA toolkits 5.5.22 and 7.0.28 on our Tesla K20Xm GPUs. cuda_memtest didn’t find any errors, unlike in the two links posted above.

Could this be a driver/hardware issue? Is there any way to detect and recover from the problem when it happens? Since all driver API calls get stuck, and there is no configurable timeout we currently have to kill the process manually, which is not feasible in production.

There shouldn’t be any restrictions about using multiple contexts on a GPU, whether from independent threads or independent processes. (These contexts must obey various resource limits of course. For example, you don’t get to have twice as much global memory as the card can support just because you run two contexts…)

Not sure what cuda_memtest is. Do you mean cuda-memcheck ? If so, did you run all subtool options?

It could be a driver/hardware issue. It’s impossible to say without a testable reproducer.

The only way I can think of to recover would be to have some sort of watchdog running. When the watchdog detects a hang, it could attempt to issue cudaDeviceReset() (or driver API equivalent), or it might need to signal some external mechanism that would either run nvidia-smi -r as root, or simply kill the offending process.

At what level that watchdog would need to be (either at the application level, or outside the application) I can’t say without more understanding of your environment.

Do you have ECC enabled on your K20Xm GPUs ?

cuda memtest is an open source memory test program using the memtest86 patterns, which seems to be used by many developers.

I did try the cuda-memcheck tools and got no clues from them. (initcheck did report errors, however, I checked those and the reported memory region is definitely initialized by cuMemset()).

From my point of view, it looks more like some issue within the driver, because of the fact that all contexts created on a GPU get stuck when the problem occurs. I think that is something that shouldn’t happen ever, regardless of bugs in the application’s kernels. Interestingly, even nvidia-smi -r doesn’t work until the process is killed.

Watchdog functionality wouldn’t be too difficult to implement within our application, but I wouldn’t consider that an elegant solution (partly due to other constraints we have), especially when the driver might be the problem.

“From my point of view, it looks more like some issue within the driver, because of the fact that all contexts created on a GPU get stuck when the problem occurs. I think that is something that shouldn’t happen ever, regardless of bugs in the application’s kernels. Interestingly, even nvidia-smi -r doesn’t work until the process is killed.”

all contexts, and smi, can get stuck for a number of reasons, not necessarily due to the driver alone
it may perhaps be in your handling of contexts for instance

check your device’s compute mode
and make sure not to have more than one context current per host thread

check your device’s compute mode
and make sure not to have more than one context current per host thread

Since there is only one process per machine, default compute mode should suffice. We have exactly one context for each thread and contexts are never accessed by other threads.

all contexts, and smi, can get stuck for a number of reasons

Can you give me some of these reasons or point me to any documentation/other resources? My problem persists even if I comment out all kernel code and just launch empty kernels. Like txbob said, there shouldn’t be any restrictions on what one can do concurrently, as long as you do it in separate threads and contexts, which brings me to the conclusion that (provided there is no bug in the driver) it shouldn’t be able to mess up in that way.

“Can you give me some of these reasons”

any possible reason that would suffice to explain an application bug
single-context applications can and do get stuck, for a number of reasons
hence, a multi-context application can get stuck for all the reasons a single-context application can get stuck, plus the additional possibilities pertaining to multiple contexts

it seems a logical fallacy to move (i can not say jump) to the conclusion that it must be the driver, when so few possibilities were exhausted
you seem to have a large application; any line can cause a violation of some sort
of course, you are free to in turn question my logic

3.4. Compute Modes p 66

H.1. Context p 205

of the cuda 7 programming guide seems relevant

the device compute mode can impact on the contexts

and you may have multiple contexts per host thread, but only 1 current context
and you need to maintain a proper context stack at all times

what is the number of nodes per your cluster?
what is the maximum number of threads per node at any time?
what is the maximum number of contexts per device at any time?

i presume the driver would fail memory allocation when contexts start to overlap - with emphasis on presume
i also do not know the max number of contexts that can exist per device

launching empty kernels seems like an excellent idea
perhaps you can now collect a stack trace to see how the application got stuck, and where it got stuck

Do you have ECC enabled on your K20Xm GPUs ?

single-context applications can and do get stuck, for a number of reasons

Do you know about any conceptual or implementation bug which might lead to the symptoms I described above?

it seems a logical fallacy to move (i can not say jump) to the conclusion that it must be the driver, when so
few possibilities were exhausted

I don’t get any error from the API and the bug persists even without any code in the kernel functions. Since there should be no way to deadlock the driver (if we assume that all constraints set by the API documentation are met), how would I be able to mess up in such a way that API calls in ALL contexts belonging to a device get stuck? There may be a bug in our application, but I believe that there has to be one in the driver too, if it’s possible to cause deadlocks in this way.

Thanks for pointing out the page in the programming guide, but our thread/context management is much simpler anyways: there is a number of worker threads which create their own context at startup and keep it active for the rest of their lifetime.
The number of worker threads/contexts per device is constant for the lifetime of the process and usually lies between 3 and 6, depending on configuration.

when contexts start to overlap
What do you mean by that? Since we only have a handful of contexts, this doesn’t look like a problem for me.

I do have stacktraces that show that the process is stuck in cuEventSynchronize(), but according to cuda-gdb no kernel is running at that point. Eventually, all other threads get stuck when they use API calls like cuMemHostRegister()/cuMemHostUnregister() (in their own contexts).

@txbob, yes ECC is enabled and there have never been any errors on any of them

Probably best if you can provide a reproducer test case.

I’m curious about this too:

" CUDA events are used to synchronize the kernel launches within a stream."

Not sure what that means. Kernel launches within a stream should already be synchronous to each other.

if pictures are worth a thousand words, projects are worth 10,000

“Do you know about any conceptual or implementation bug which might lead to the symptoms I described above?”

i once, by accident, started deleting arrays on the device and the host before the device was done with them; manifested as the same
hence, there is one case for you

txbob alludes to a stream race; this too may be a cause - attempting to synchronize on a void - no event, etc
there are other types of stream races too

what is the max number of apis that can be queued up at any time?

perhaps the problem is not on the device side then, but more on the host side
did you try valgrind
in some cases, it seems more powerful than memcheck, etc

“> when contexts start to overlap
What do you mean by that? Since we only have a handful of contexts, this doesn’t look like a problem for me.”

perhaps txbob can deliberate more
txbob already pointed out that contexts can not and do not multiply memory, and i concur
my understanding is that contexts are separate addresses spaces
if you then create multiple contexts per device, these may grow closer to each other
i suppose the driver would prevent overlap - the sum of allocated memory across the contexts being greater than total (device) memory

I already said that I only have a handful of contexts, which won’t exhaust the GPU’s memory. We track and monitor all allocations and the behavior can be reproduced while less than 20% of the GPU’s total memory is used. Also, if overallocation was an problem I would expect to see errors from the API, not a deadlock.

Unfortunately, it seems to be very hard to write a standalone reproducer.

One thing I noticed is that the problem seems to go away, if I schedule all asynchronous operation on the default stream instead of my own streams (it still occurs if I use only one stream created with cuStreamCreate()).

We usually use one or more streams per context to run parts of our calculations concurrently, and all memcopys, memsets and kernel launches happen asynchronously in these streams (typically less than 10 operations per stream). Before the streams are created, there are some minor synchronous operations like allocations, initialization of module globals, etc.
Even if I haven’t found anything in the driver API documentation, could there be any restrictions to mixing synchronous and asynchronous operations like that?

“I already said that I only have a handful of contexts, which won’t exhaust the GPU’s memory. We track and monitor all allocations and the behavior can be reproduced while less than 20% of the GPU’s total memory is used. Also, if overallocation was an problem I would expect to see errors from the API, not a deadlock.”

i attempted to make 2 points:
a) it should not be about the number of contexts, but total memory footprint, and whether different footprints potentially overlap
b) i do not know how the driver handle allocations across multiple contexts; i suspect the driver would i) attempt to prevent overlap, and ii) fail on new allocations implying overlap

again, i do not know the answers to these questions, as it is not documented
i merely ran with hypotheses; which, unless tested, merely remain hypotheses

i suppose one could set up test cases to evaluate these hypotheses

“One thing I noticed is that the problem seems to go away, if I schedule all asynchronous operation on the default stream instead of my own streams (it still occurs if I use only one stream created with cuStreamCreate()).”

perhaps focus on this then
so, to do this, i take it that you remove your own stream specifications, and specify the stream as ‘0’ as part of all asynchronous calls then?

non-default streams normally relax certain constraints, etc generally in place with the default stream
the act of synchronization may become easier
in some cases, synchronization may be across streams, rather than per stream, with the default stream
my thoughts would be to ensure that stream synchronization is proper - the anti-thesis of stream races
ensure that all (stream) synchronization calls are ‘reachable’, and that it does not imply the ‘carrot in front of the donkey’s nose’ scenario
if a synchronization call is unreachable, the synchronization call too would run indefinitely
synchronization calls may become unreachable when a) the ‘event holder’ is destroyed prematurely, b) no event was recorded in the first place, c) the event is over-written, before it is reached
in the above, i use the term event in a general manner, not necessarily limited to cuda events only
all of this builds on the notion of a stream race - a race at the level of streams, as opposed to the level of threads, etc; just as proper synchronization may fail at the level of threads, proper synchronization may fail at the level of streams

another test may be to see if you can use your own stream specification, as opposed to the default stream, when stream calls/ work is limited (from less 10 to only 1)
and can you note the particular synchronization call that runs indefinitely - is it the same call, or does it change? and does it order in terms of preceding stream calls/ work change or not?

I was finally able to reproduce the issue in a standalone test case, which you can find here: [url]https://gist.github.com/0xee/bf6b3d9ded7ebd574dad[/url]

So far, it only happened for the defaults configuration set in the Makefile, but I’m running more tests on our cluster right now and will update if i can find out more.

It usually takes about one or two hours to run into the deadlock situation, but I suspect this can be sped up by some parameter tweaking.

The program starts the configured number of threads which all create a context and do some nonsensical work in a loop. A monitor thread prints the current loop counter for each thread, which means the program deadlocked when the counters stop changing.
In this case, gdb shows, that one of the threads tries to perform some driver API operation and all the others are stuck while apparently waiting on a lock inside the driver.

@little_jimmy, first, it looked like it was always a synchronization call, maybe because of the arrangement of the API calls in our code, but in the repro code, it has happened with any of the functions called within the loop already.

Can anyone see if there’s anything wrong with the code in my test case?

I’ve started a test now on a dual K40m system with gcc 4.8.4 and CUDA 7.0, RHEL 6.2.
I’ll update in a few hours with any observations.

Right now the counters are just incrementing.

are you using pinned memory with your asynchronous memory d>h transfers, my lord?

i scanned the code, but can not confirm this (i seem to confirm the contrary)

it is late (for me at least); i remotely hear my mind telling me something like:

i) non-pinned memory would render asynchronous memory transfers synchronous
ii) a worker thread must then be called in to assist with this, over and above your worker threads
iii) too many threads busy waiting may mean stream synchronization calls become unreachable, as the now synchronous asynchronous memory calls struggle to complete, causing build up and essentially causing deadlock - the worker thread necessary to complete the transfer can not commence, as there are too many other threads in the way

ask cuda maestro txbob to confirm or refute the argument

if the argument holds, consider pinned memory, or at least setting sync flags to yield rather than busy wait

I haven’t fully analyzed the code, but there is a call cuMemHostRegister in there. That pins memory.

yes, quite; there are other flaws with the argument too
a synchronous-rendered asynchronous call may merely mean the api does not return until the transfer is complete; it does not necessarily imply additional worker threads
i think i have successfully demonstrated the case why one best should not post when your mind is at its end

“In this case, gdb shows, that one of the threads tries to perform some driver API operation and all the others are stuck while apparently waiting on a lock inside the driver”

so, are you running the test case in the debugger?
(as an observation, i now have reservations about running (extremely) lengthy test code in the debugger for too long, as the debugger seemingly builds and maintains some sort of record or trace; currently, it is my best explanation for incurrences of cudbgDriverInternalError i have encountered)

can you note the particular API call; is it the same one every time, does it change?

also, is the timing of the error the same, or not - does it change, or does it occur more or less at the same point in time, each time?

thanks, @txbob
I’m currently running with CUDA 7.0, gcc 4.7 on opensuse 12.3

The transfers are (and should be) from pinned memory, but at least in our real application, the bug occurs in both cases, pinned or pageable.

so, are you running the test case in the debugger?
No, I just use gdb to attach to the process when the counters stop moving. I use gdb rather than cuda-gdb for that to avoid that cudbgDriverInternalError which I get nearly every time I attach with cuda-gdb. Running the program in the debugger would just take forever and provide no additional information, I think.

On our cluster, I rarely see the counters go above ~10000 with an estimated median of the ‘deadlock value’ being <3000

I just added an sample gdb log to the gist, showing the state of the program as it is stuck. (I also updated the makefile to make g++ add debug info)
https://gist.github.com/0xee/bf6b3d9ded7ebd574dad#file-gdb_log_0