X hangs using 100% CPU, WAIT and mieq overflowing errors in logs

ahktenzero · March 19, 2013, 12:37am

ASUS P5K3-Deluxe
Core 2 Duo E6850 @3Ghz
8Gb RAM
Xorg 1.12.4
Linux 3.7.3 with Debian patches (custom config)
NVidia driver 313.26
Gainward GTX 560 Ti

This problem has been occurring for a while now, since November last year perhaps, I can’t remember exactly. After a random period (anywhere between a few hours and a week), X stops responding. The mouse cursor still moves but it doesn’t respond to keyboard input, and the display no longer updates other than the mouse cursor. It seems to mostly happen when using Firefox but I have had it happen doing other things. The X server uses 100% of one CPU core and has to be killed with SIGKILL. It works normally again after restarting until the next occurrence.

When I started getting the problem I was using a Gainward GTX 460. In late December it started getting really bad, repeatedly ocurring after 2 hours or less, and as I was also getting frequent lockups and texture glitches playing games in Windows I figured it was a hardware fault and RMA’d the card. The card was tested and found to be defective.

While I was waiting for the replacement card I used a spare one, an ASUS Geforce 5500FX. I had to downgrade to the 304.64 drivers as that card isn’t supported by the newer ones and it ran stably for nearly a month.

I was sent a 560 Ti as replacement for the faulty card, which I installed last weekend along with the latest NVidia drivers. I had no problems until Friday evening when the same problem reoccurred. Over the weekend I spent some time in Linux and some in Windows playing games and had no problems. Today I had another lockup after about 13 hours of use.

Previously I tried a number of combinations of Linux kernel versions (3.2, 3.6, 3.7) and Nvidia drivers (304 series, 310 series, 313 series) but it didn’t make any difference.

From looking at other reports of this problem I’ve seen suggestions to disable the IOMMU by adding intel_iommu=off or amd_iommu=off to the boot command line, I will try that and see if it improves things.

NVidia bug report is available here: [url]http://yog-sothoth.mohorovi.cc/tmp/nvidia-bug-report.log.gz[/url]

Let me know if you need any more information.

ahktenzero · March 29, 2013, 3:22am

I’ve had another ocurrence, this time with a 3.8 kernel and 313.26 drivers, after 3 days and 12:36 hours of uptime. It was triggered by opening a new browser tab from an external program. The browser was offscreen at the time, I don’t know if that makes any difference.

I’m going to try downgrading to to the 304.84 drivers, see if those are any more stable. When I was running the older card using those drivers I had no lockups at all.

Bug report attached.
nvidia-bug-report.log.gz (55.8 KB)

ahktenzero · April 1, 2013, 10:49pm

Had another lockup with 304.84, after 22:16 of uptime. Same situation as the last ocurrence.

Downgrading to 304.64.

Bug report attached
nvidia-bug-report.log.gz (61.7 KB)

ahktenzero · April 5, 2013, 11:37pm

So far I’ve had no lockups with 304.64, stable for 3 days 23:00. Though it’s taken over a week for it to occur before so I’m not considering this resolved yet.

I do have an error in my Xorg log as below from earlier today:

[311109.843] nvLock: client timed out, taking the lock.

The timestamp works out as 3 days 14:25. There might have been a tiny moment of unresponsiveness around that time but I was running a compile in the backround amongst other things so it’s hard to say. I can’t find any other reports of this error which look relevant.

One other thing I noticed was memory usage on my GPU had got up to 50%. Restarting my browser dropped it to 35%. I’ve added a widget to my statusbar to display GPU temp, fan speed and memory usage (taken from nvidia-smi). I doubt it’s related, the memory usage in the last couple of bug reports I sent is only 25% but I’ll keep an eye on it just in case. The temps in both are 30-31°C so I don’t think it’s overheating either.

ahktenzero · April 5, 2013, 11:48pm

And 10 mins later I get another lockup. Bug report attached.
nvidia-bug-report.log.gz (61.6 KB)

ahktenzero · April 11, 2013, 1:23am

It’s happened again, same kernel (3.8.3) and NVidia driver (304.64) versions. This time it only lasted 10 hours. Same scenario as before, opening a new tab remotely.

GPU temperature was 32°C when it happened, fan at 30% so it’s not overheating.

As an experiment I rebooted into Windows (7, 64bit) last Friday evening and left it running all weekend to see if anything similar would happen. I didn’t do much web browsing but I did play Borderlands 2 for several multi-hour sessions without problems. If there were general instability issues caused by lack of power or cooling I’d expect them to show up during activities which stress the GPU like gaming.

I’l have to try the new 319 beta drivers, see if those are any better. If anyone has any suggestions for extra debugging options I can turn on which might help narrow down the cause of this I’d appreciate it.

Bug report is attached.
nvidia-bug-report.log.gz (61.2 KB)

cryptor · April 11, 2013, 10:32pm

Hi Ahktenzero,

My error log looks superficially similar to yours.

Please let me know if you find a resolution to your problem.

Cryptor

ahktenzero · April 16, 2013, 5:06pm

I’ve had another ocurrence, still on the 313.80 drivers, 3.8.3 kernel. Bug report is attached.

This time there are no WAIT lines in the X logs, only the miEQ overflow errors, which seems rather odd as I had thought the problem was due to the GPU not responding but the error which indicates that isn’t present.

One odd thing I’ve noticed is my GPU seems to be stuck on the highest performance setting since my last reboot. It’s set to use adaptive performance levels in nvidia-settings, and I’m not running anything GPU-intensive. As a result it’s about 10°C hotter than usual but still within safe limits (between 44-48°C). It was generally running on performance level 1 before.

The only thing which has changed since last time is I’ve swapped one of my CRTs for an LCD panel (my second CRT having failed over the weekend).

I’ve also got MSI disabled (I think I turned it off when trying to fix this problem with the previous card) so I might turn that back on and see it it makes any difference.

nvidia-bug-report.log.gz (58.4 KB)

ahktenzero · April 16, 2013, 11:58pm

Another lockup. This time there were WAIT errors in the X logs. Bug report output attached.

I’ve installed the 319.12 beta drivers now. I’m getting loads of kernel warnings as in this thread.

nvidia-bug-report.log.gz (58.7 KB)

vacaloca · April 18, 2013, 2:50pm

It might be moot since you mention you tested on Windows, but you might want to see if you get the same lockups with another power supply under Linux. Did you find any drivers that do NOT experience this problem? That might help to figure out where the bug comes from – cryptor mentioned that he did not have the problem with the 295.40 drivers.

ahktenzero · April 18, 2013, 5:32pm

I didn’t go back as far as the 295 series drivers. I was going to try those next but the 319 beta drivers came out so I’m trying those instead.

I upgraded the PSU on my machine last year, late September or early October but I was having stability problems before that. Those problems were pretty much exclusively when playing games in Windows though. I bought the new PSU hoping it might improve the situation, I was using an Enermax ELT620AWT which I replaced with an Enermax EGX850EWL (bought on eBay). If the 295 drivers don’t help I’ll try swapping the old PSU back in.

ahktenzero · April 22, 2013, 5:14pm

No lockups during the period from when I installed the 319.12 drivers to when I rebooted Friday night. Had one today after only a few hours of uptime. Bug report is attached. This time it includes nvidia-debugdump output; Debian don’t have a package for that utilty for some reason so I didn’t have it installed when I was using the packaged drivers.

I changed my mind and I’ve put the old PSU back in. I’m still using the 319.12 drivers. If I get another hang then I’ll try the 295 drivers.
nvidia-bug-report.log.gz (95 KB)

ahktenzero · April 25, 2013, 11:07pm

So, the good news is I didn’t waste £55 on a dodgy secondhand PSU. The bad news is I got yet another X lockup. Bug report attached as usual. I can’t do much else in terms of hardware substitution as I don’t have any spare boards with x16 PCI-E slots.

Next step is to try the 295 series drivers.
nvidia-bug-report.log.gz (95.8 KB)

ahktenzero · April 26, 2013, 12:45am

I’m running 295.75 now, on a 3.8.5 kernel. I had apply some patches, edit some files so it was looking for version.h in the right place and use a newer nvidia-installer binary but it’s installed and working.

Annoyingly nvidia-settings doesn’t appear to provide memory usage with this driver, but as memory usage hasn’t got much past 25% at the point previous lockups have occurred I’m not sure it’s related.

VadimLinux · April 26, 2013, 4:49pm

ASUS P9X79 Pro
i7-3960x
64GB RAM
Debian wheezy with plain custom 3.8.5 kernel
NVidia driver 319.12 (tried 313.30 and 310.44 with the same bug)
GeForce Titan
PSU: Silverstone ST1500. Manufactured: 02.2012

When I run 3D applications they stops to respond, no keyboard input works, but I can still move mouse cursor. Application continues to run and use CPU. After killing application, [migration] kernel thread uses cpu in several dozens secs. And then I stuck with Xorg using 100% CPU.
Its possible to kill Xorg with Alt-SysRq-k. Screen remains the same (with window manager workplace on it) in a minute or so, then it switch to text console with garbled fonts and never changes after that dependless of what I’m trying to do.
nvidia module can be safely removed and reinstalled without warnings in system logs.
But its impossible to restart X again.

NVRM: RmInitAdapter failed! (0x26:0xffffffff:1170)
NVRM: rm_init_adapter(0) failed

Videocard temperature after long furmark test (with wine) never exceeded 81C.

Attempts to run nvidia-smi fail with error message:

NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

P.S. In about half an hour after killing Xorg, system started to respond. Its possible to switch text consoles now and fonts are garbled only on 7-th console (former Xorg tty).
startx fails with error messages:

NVRM: rm_init_adapter(0) failed
dmar: DRHD: handling fault status reg 402
dmar: DMAR:[DMA Read] Request device [01:00.0] fault addr 937ffc4000
DMAR:[fault reason 06] PTE Read access is not set
NVRM: RmInitAdapter failed! (0x26:0xffffffff:1170)
NVRM: rm_init_adapter(0) failed

nvidia-bug-report.log.gz (63.9 KB)

ahktenzero · April 29, 2013, 2:56pm

I stayed in Linux over the weekend and had another lockup this afternoon after 3 days 14:35 of uptime. Exactly the same trigger as usual (opening a new tab with the browser offscreen). It looked like it recovered for a few seconds after the initial lockup but went back to being frozen. With separate X screens the cursor seems gets stuck on the screen it’s on when the lockup occurs, with the 3xx series drivers using XRandR I could still move the cursor between screens after the lockup. Bug report is attached.

I did notice some odd behavior over the weekend. The Flash plugin crashed a few times due to failing to allocate memory during GXLCreatePixMap, taking my browser down with it. When watching fullscreen video in mplayer if a notification appeared on top of it it would stop playback until the notification disappeared. This happened with both GL and VDPAU video output.

I’m going to try the 275.40 drivers. Getting them to compile for the 3.8 kernel might be a problem so I’m going to switch back to 3.2.
nvidia-bug-report.log.gz (108 KB)

ahktenzero · April 29, 2013, 8:22pm

295.40 seems to be worse. X locked up while I was away from my computer for a bit, all of the previous lockups have been while I was using it. There are the usual WAIT and mi EQ overflowing messages, but preceeding those there are a lot of messages like

[ 14571.676] (II) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
[ 14571.676] (II) NVIDIA(0):     recover...
[ 14571.684] (II) NVIDIA(0): Error recovery was successful.

in the log preceeding them, which I didn’t get on any of the more recent driver versions. Bug report is attached.
nvidia-bug-report.log.gz (107 KB)

ahktenzero · April 29, 2013, 10:44pm

I’ve moved my card over to the secondary x16 slot on my motherboard (which runs at a maximum x4 link). Previously when I was using the FX 5200 I had it in this slot as it wouldn’t work in the primary one.

I’m not sure what else to try. Support for the 560 Ti was added in 270.26 so I could try going back that far but I’m not sure it’s worth the effort.

ahktenzero · April 30, 2013, 2:34pm

Moving the card to the other slot hasn’t made any difference, I had another lockup today. Bug report is attached.

I’m going to go back to the other card I was using (which turns out to be a 7500LE, not a 5200 FX) and see if my system will run stably with that.
nvidia-bug-report.log.gz (97.5 KB)

vacaloca · April 30, 2013, 3:23pm

If the other card still has the same issues it could be a power problem still. You mentioned your first power supply might have been bad (the one you’re using now), and you purchased a second one from the same brand used (the one you were using before)… doesn’t bring much reliability into that equation. At the very least see if you can get a power supply tester and make sure they’re outputting the right voltages.

If you have another Linux test machine that you can test the 560Ti to see if the issue is the card itself or not that would be ideal. So far you’ve identified it’s not a slot issue, (probably) not a driver issue, (maybe?) not a power issue but still inconclusive on my end.