Process receives realtime signal 34 during exit. Nvidia driver to blame?

tsondergaard · December 3, 2014, 9:53pm

We are observing processes that receive a totally unexpected realtime signal 34 during shutdown. I set up a signal handler with sigaction() with SA_SIGINFO that prints information about the sender of the signal and then calls abort(). With this handler I can see that the signal is coming from the process itself and the stacktrace I get look like this:

(gdb) bt
#0 0x0000003ee5e32925 in raise () from /lib64/libc.so.6
#1 0x0000003ee5e34105 in abort () from /lib64/libc.so.6
#2 0x00007ff36b66e98f in qtutility::qtutil::diagnoseRtSignal (sig=Unhandled dwarf expression opcode 0xf3
)
at /home/jenkins/workspace/personal-ts-7.0-all-tests-2/arch/rhel6_x86_64_ev6/qtutility/src/qtutil.cc:723
#3 0x0000003463eb3e33 in ?? () from /usr/lib64/libGL.so.1
#4 0x0000003463eb4890 in ?? () from /usr/lib64/libGL.so.1
#5
#6 0x0000003ee5ee53c9 in syscall () from /lib64/libc.so.6
#7 0x0000003463eb4a05 in ?? () from /usr/lib64/libGL.so.1
#8 0x0000003463eb4e8f in ?? () from /usr/lib64/libGL.so.1
#9 0x0000003463eb50cc in ?? () from /usr/lib64/libGL.so.1
#10 0x0000003463eb51ca in ?? () from /usr/lib64/libGL.so.1
#11 0x0000003463e900f5 in ?? () from /usr/lib64/libGL.so.1
#12 0x0000003ee560ebac in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#13 0x0000003ee5e35e22 in exit () from /lib64/libc.so.6
#14 0x0000003ee5e1ed24 in __libc_start_main () from /lib64/libc.so.6
#15 0x0000000000409ab1 in _start ()

As you can see the program is executing in the nvidia driver when it receives the signal. Based on this I googled a bit and found this webkit bug report that is also about an unexpected signal 34 being received. The bug report attributes this to a bug in the NVIDIA driver.

https://bugs.webkit.org/show_bug.cgi?id=101614#c8

This problem is observed on a VM on an ESXi 5.5 hypervisor where the VM is assigned an NVIDIA Quadro K2000 GPU via passthrough (vDGA style). It happens when the VM (and the ESXi host) is under heavy load.

I’m addressing this forum in the hope that an NVIDIA driver developer can tell me whether the NVIDIA driver could be the culprit and what I can do about it, if that is indeed the case.

Best regards,
Thomas

tsondergaard · December 3, 2014, 9:57pm

[url]http://spear.medical-insight.com/~ts/nvidia-bug-report.log.gz[/url] from the VM

aplattner · January 8, 2015, 4:03pm

Hi tsondergaard,

The NVIDIA driver does use real-time signals in order to synchronize threads in certain circumstances. It’s possible that the driver is failing to correctly handle these signals in some threads. I’ve filed bug 1594134 for further investigation. In the meantime, can you work around the problem by modifying your signal handler to not call abort()?

tsondergaard · June 9, 2016, 6:48pm

Hi Aaron,

We are still affected by this issue. I’ve just upgraded to the latest long-lived driver 361.45.11 and we are still seeing signal 34 errors occasionally. The bug you mentioned - bug 1594134 - is there any news on that?

Btw is the bug tracker public?

Thanks,
Thomas Sondergaard

aplattner · June 9, 2016, 8:15pm

The bug is still open. It would probably help the investigation if you could provide a test that reliably reproduces the problem, if possible.

The bug tracker is not public, sorry.

tsondergaard · June 24, 2016, 9:49am

Further experimentation has revealed more details:

I have only been able to reproduce it in a Qt application running under one specific configuration - changing any one of the following details makes the problem go away:

Running under the Froglogic Squish GUI testing toolkit
One particular system: A CentOS 6 x86_64 VM running on ESXi 5.5 on a dual socket xeon e5-2650v2 system with an NVIDIA Quadro K2000 GPU.

I have now upgraded the OS in the guest VM from CentOS 6 to CentOS 7 and I haven’t seen the problem since. We’ve run with the new CentOS 7 configuration for a week now and we would normally have seen the signal 34 (RTMIN) issue several times already. This is a good enough solution/workaround for
me.

tsondergaard · June 24, 2016, 1:52pm

I spoke too soon. It just happened again. The upgrade to CentOS 7 doesn’t fix the problem, but it does seem the frequence with which it happens is much reduced.