Recent drivers cause applications to hang, not start at all or compilation failures

Dear nvidia developers,

with recent drivers, including the most recent 332.20, issues appeared which lead to compilation errors, applications hanging or not starting at all.

More details on what applications are affected and other users affected can be found at
https://bugs.gentoo.org/show_bug.cgi?id=487558

Attached are two bug reports, one with the most recent driver where the issue appeared, one with the older 325.15 driver where the issue doesn’t appear. Please note that aside from the driver version the system, software used and configuration is exactly the same.
(
Assuming they are attched, it just shows “scanning, please wait …” for ages. Pastebin, just in case:
Bug report from when problem didn't occur - Pastebin.com correct
Bug report from when problem occured - Pastebin.com error occured
)

Thanks in advance for investigating and fixing this.

nvidia-bug-report-faulty.log.gz (56.5 KB)
nvidia-bug-report-correct.log.gz (62.6 KB)

I have a similar issue with every 331.x using Opensuse 11.4 “Evergreen” (the long term support project so it is Kernel 3.0.80), 32bit, KDE 4.6, GTX 660 Ti

Not deeply investigated till now, e.g. dolphin takes 1-2 minutes to start, some times plasma crashes when I login and ending up on a black desktop with nothing on even the programs seems to run,and in digikam batch mode loading / writing files takes ~30 seconds each.

All before including 325.15 is ok. So using it at the moment.

Will attach bug report later when I have some time.

This seems to only affect certain distro (or Kernel?) versions - I have a parallel install of OpenSuse 12.3 - there are NO problems with 331.x drivers

UPDATE:
report for 325.15 (all OK)
nvidia-bug-report.325.15.log.gz.jpg

install-report for installing 331.20:
file nvidia-installer.331.20.log

strangely enough after reinstalling 331.20 everything worked flawless after reboot & login, report for that session:
nvidia-bug-report.331.20.log.gz.jpg

So i rebooted again and logged in, and it happened: "plasma crashed, means desktop not shown (black), mouse cursor is still visible). Applications worked it seemed as I heard music from my mp3player. Report for this: file “nvidia-bug-report.331.20-2.log.gz.jpg”

p.s. did add .jpg on all files as earlier here only jpg attachments were allowed. Maybe it is not necessary anymore
nvidia-installer.331.20.log (787 Bytes)

According to the gentoo bug report what seems to be the problem is the signal mask being propagated. “SIGCHLD signals are being masked, so zombies never get reaped by processes that expect to reap children manually (as opposed to ignored)”

It would be nice to see nvidia confirming (and hopefully being able to reproduce) this issue, so we know it is being worked on.

I am on Arch Linux x86_64, have installed the driver from the official repositories. I am using Xfce4 desktop environment. I have not seen any applications that fail to launch, but I do see zombies appearing. Furthermore, I normally close the Xfce4 terminal with ctrl-d. Now it seems to be random if the terminal close or not. Also typing ‘exit’ to close the terminal suffer from the same randomness, sometimes it closes the terminal, other times it just hangs and I have to close it with alt-f4.

At the moment there is no mention of this on Arch boards, but this driver only made it into the official repos a short time ago.

I hope nvidia devs is aware of the issue and looking into it.

Foxie, Please provide reproduction steps in detail.

Hello sandip, thanks for the reply.

I wish it was that easy to give you a way to reproduce it in 100% of cases, unfortunately the best I can give at this moment is a “happens in most cases”, which is simple:

  1. Have a recent (> 319.49, with 331.20 it is still there but seems to happen less often) nvidia driver
  2. Have KDE running and use the KDE PIM suit (kmail, kaddressbook, …) which uses akonadi (PIM storage based on sqlite/mysql)
  3. Just do a regular boot and log in to a regular KDE session
  4. Open a KDE PIM application (e.g. kmail, kaddressbook, kontact …)

In 487558 – >=x11-drivers/nvidia-drivers-{331.17,319.49} causes processes to wait due to memory corruption you find other issues only happening with the nvidia drivers, including one with the evolution mailclient and bogofilter. I don’t use this software and hence can’t comment, but maybe these are easier for you to reproduce. You also find information (like gcc/glibc/nvidia driver versions used) about other affected systems.
https://bugs.gentoo.org/show_bug.cgi?id=487558#c12 could also be helpful.

Also signal handling seems to be helpful to reproduce this, as per http://forums.gentoo.org/viewtopic-t-975106.html?sid=8b4f5670553424affe500aad0b28b764 there is a simple bash line to reproduce it easily in konsole (KDEs terminal emulator, others might be affected as well): http://forums.gentoo.org/viewtopic-p-7437014.html?sid=f0956320ff1a7c17ad2f7acdeb4da85f#7437014

Sorry I can’t provide more, I hope you’ll be able to track this down and fix it.

Kind regards,

Christian

Arch forum post with some more info:
https://bbs.archlinux.org/viewtopic.php?pid=1350302

Hello!
I have the same problem but I got here after debugging the problem the hard way. I’m using ArchLinux 64, and my first symptom was with the debugger failing to debug anything.

I have tracked the issue to the process signal mask set from the nvidia SOs. The weirdest thing is that the sigprocmask is always changed from inside a call to pthread_create(). I’m suspecting a TLS issue, but I cannot be sure…

With the correct 325.15 libraries, the process sets the signal mask to 0x00000001 (correct).
But with the 331.20, the process sets the mask to some random value that looks a lot like a memory address 0x00007ffeaa5da000, for example.

Since the particular signal mask is different each time and it is inherited by child processes, the system behaves in very weird ways, always different.

I hope this helps someone.

Regards

Rodrigo

Hello,
I can confirm this problem with driver 331.20 on openSUSE 12.3 64bit (Kernel 3.7.10-1.16-desktop, Nvidia 260GTX) and with d-bus and akonadi.
On another PC with openSUSE 12.3 64bit (same kernel but with a Nvidia 6800GT and Nvidia driver 304.108) everything work as expected.

Before the upgrade to 331.20 I ran 319.32 without any problems, but now the d-bus registration from akonadi fails and sometimes I have virtouso-t zombies.

After a downgrade to 319.32 d-bus and akonadi work again flawlessly.

Anything new on this? This is really rather severe and annoying…

Same thing for me. Random segfaults, windows disappearing, processes zombifying.

Hi Rogrido! Your debugging efforts seem very interesting. Did you also e-mail this to the support address to make sure it gets noticed upstream? It would be wonderful if this could finally be resolved somehow.

Hi, I am using a Gentoo distribution. The nvidia drivers I am using
is version 319.49, if I upgrade to a newer version I suffer these
problems:

  • VMWare Workstation do not starts virtual machines, giving error
    “Cannot find a valid peer process to connect to”.
  • Some wine/crossover application (f.e. DVDFab) not starts at all.
  • SMPlayer/mplayer hangs on quitting application or, if I am watching
    the TV, when I change channel.
  • Mono applications (f.e. Keepass) do not starts at all.

A workaround to these problems is it start the applications from
the terminal (very boring).

Another issue with drivers > 319.49 is with the SLI, it does not work,
giving error “trouble accessing pci config space”.

There 's an opportunity to have a future version of the driver working? :)

I apoligize for my english.

Hi, I am using Arch Linux (x86_64)
Having the same issue with 2 applications (by now)

  • Banshee : Just wont open stayin stucked in…

** Running Mono with --debug **
[1 Debug 12:05:03.023] Bus.Session.RequestName (‘org.bansheeproject.Banshee’) replied with PrimaryOwner
[1 Info 12:05:03.028] Running Banshee 2.6.1: [ArchLinux (linux-gnu, x86_64) @ 2013-10-16 08:54:35 UTC]
[1 Debug 12:05:03.039] Initializing GTK

-Osdlyrics : After a period of time it freezes showing a black box instead of the transparency above all the windows and the last output is

Error: in function ol_dbus_get_uint: ol_utils_dbus.c[132]
call GetPosition failed: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

A workaround for banshee is using

sudo -u [myusername] banshee

or switching to nvidia-304xx drivers but doing that doesn’t allow Chromium to use nvidia card for hardware accel throwing:

NVIDIA: could not open the device file /dev/nvidia0 (Operation not permitted).

This can only be bypassed by using Chromium as root.

Thank you
Hope this helps in something to fix this issue ;)

Hi, I am using Gentoo x86_64 with nvidia-drivers-331.20

Recently attempted an emerge -avuND world (to update my system) and the process kept hanging on sys-devel/gettext. Searched the Gentoo Bugzilla and came across this bug:
https://bugs.gentoo.org/show_bug.cgi?id=487558

Dropping out of X into tty and I am able to finish the borked emerge mentioned above. Also note that ctrl+c anything in terminal window does nothing (should stop the currently running process).
nvidia-bug-report.log.gz is attached.
nvidia-bug-report.log.gz (184 KB)

Filed bug to track this issue Bug 1431249 : kde: kmail, kaddressbook not launching with akonadi error on 331.20

Issue no longer reproduce with our latest driver that will be available soon .

Could this also account for a problem with SIGWINCH?

When will that be?

Me too, my experience with OpenSuse 13.1 is becoming a crap by that driver.

I observed similar problem on openSUSE (Access Denied). For me drivers 319.82 seem to have fixed it.