Deadlock in call to glXMakeCurrent with 361.42

Hi,

Our application usually uses 331.20 with either GT630, GT430 or a 9500GT.

To support K1200 we recently upgraded the driver to 361.42. Everything is fine except that two threads created by our application will eventually deadlock on calls to glXMakeCurrent.

When the problem is reproduced on GDB we find each thread is calling pthread_mutex_lock in libpthreads from a return address located in libGL. We determined that the return address was in libGL using /proc//maps. Looking at the mutex’ structures we can see that each is owned by the other thread thus the deadlock. GDB can’t give us a source line for the mutexes so I assume they are created by nvidia code and if it helps one is on the heap and the other isn’t. I would try to determine who created them using valgrind but I can’t get that working at the moment.

The following grabs may not have been obtained from the same build and run of the program, so compare values with this in mind.

Thread 48 (Thread 5645):
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb64caafe in __lll_mutex_lock_wait () from /usr/share/mk7i-toolchain/lib/libpthread-2.5.1.so
#2  0xb64c6c57 in _L_mutex_lock_742 () from /usr/share/mk7i-toolchain/lib/libpthread-2.5.1.so
#3  0xb64c69cd in __pthread_mutex_lock (mutex=0x3378f60) at pthread_mutex_lock.c:64
#4  0xb79eba29 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
b7944000-b7a02000 r-xp 00000000 08:21 1819225    /usr/share/mk7pc/x11build_nv/lib/libGL.so.361.42
ldd mk7i | grep libGL
        libGLU.so.1 => /usr/share/mk7i-toolchain/lib/libGLU.so.1 (0xb76e1000)
        libGL.so.1 => /usr/share/mk7i-toolchain/lib/libGL.so.1 (0xb7681000)

The rest of the backtrace from our code into nvidia code is not available simply using GDB’s backtrace command, presumably because libGL is built without stack frame pointer. Therefore by dumping the stack as an array of words I found in each case a return address to our code just after a call to glXMakeCurrent, thus determined it always involves our code calling glXMakeCurrent. This is confirmed by adding a mutex around calls to glXMakeCurrent so that only one thread at a time can call into it; this prevents the issue but we’d prefer not to use a workaround permanently.

So the investigation is now focused on showing whether there’s a problem in our installation and/or usage of GLX, or the Nvidia driver has a bug. To get the ball rolling I’ve attached the required nvidia-bug-report obtained at the moment the program was deadlocked. Can you advise us what else we can collect e.g. since it involves glXMakeCurrent maybe the output of apitrace.github.io or Nvidia’s own API debugger is required.

Miscellaneous information:
• Happens when everything else is kept constant and only the driver is changed
• Reproducible on these cards (note: we only tried to reproduce it on these cards): GT630 and K1200
• Reproducible on all the drivers we’ve tried that support K1200 (note: we only tried to reproduce it on these versions): 349.16 (the earliest to support K1200), 361.42, 367.27

nvidia-bug-report.log.gz (45.8 KB)

[Deleted some debug output that may have contained company sensitive information. Can send such information via email etc.]

Have reproduced the problem on 367.35

Found the problem!

A thread was calling glXMakeCurrnet() without a glXMakeCurrent(NULL) in between. Violates this part of GLX spcification:

“Only one rendering context may be in use, or current, for a particular thread
at a given time”

glXMakeCurrent() didn’t return an error though, I wonder if it should?