Instability problems with Xid error 16 or GPU has fallen of the bus (Arch, GTX970) [SOLVED]

Hi,

I have this intermittent instability problem. Sometimes the GPU will crash with Xid error 16 in the logs (see the attached nvidia bug reports).

The hang happends quite intermittently, and is a bit difficult to reproduce. If I do not run graphically demanding applications - or only lightly demanding - it might happen only once every 240h of usage (i.e. maybe once a month???). But with XCOM: Enemy Unknown (which is not that demanding, IMO) it happens more frequently, say once every hour.

While trying to rule out defective hardware I have:

  1. tried to set the fan to a constant, high speed (50%, which is way higher RPM than it normally would ramp up to in any scenario) and
  2. Underclock the GPU with -100MHz offset

This is why I have “Coolbits” enabled currently in Xorg.conf, in case you are wondering. I have not overclocked the GPU, and the hangs happened before I enabled it - or, to put it more correctly, I only enabled it while trying to rule out thermal / HW issues.

With underclocking (-100MHz) a different error is produced. With the stock clock, I first get Xid error 16 in xorg.log, but with the slight underclock, I get “GPU has fallen of the bus”.

However, the end result (from user perspective) is the same in both cases: the GPU will hang, with a black (sometimes a dark hue - blue or purple - for 2-10 seconds) screen, along with X.org and all processes under X.org (also, no switching to VCs anymore, SysRQ does still work). After a while (less than a minute) my TV will say “not connected”. I can log in via SSH and usually (95% of the time) shut down the system gracefully (the shutdown will take ages since the system waits for the user processes under X.org to stop, but they are in a broken state), but even the shutdown is not 100% reliable. If I let the system run after the GPU has hung, I believe the whole system (kernel?) will hang after perhaps 10-120minutes, after which log in by SSH is no longer possible.

I do not think I have ruled out defective hardware, but I think it was more stable with an older version of the driver (don’t know which, since I do gaming only intermittently, and if I do not strain the card, it is quite stable). I believe the card is still under warranty, so if you have any tips how to determine if this might be faulty hardware after all, that would be appreciated, too!

Forgot to tell my system details:

  • Arch Linux, same issue with several different kernels (I have tried stock=4.7.4, zen=4.7.3 and lts=4.4.21 branches in Arch)
  • EVGA GTX 970 (04G-P4-3975-KR)
  • ASUS Maximus VII Gene + i7-4790k + 16GB RAM
  • Nvidia 370.28 - but at least the previous version was affected, too!

EDIT: some minor wording edits, also made the experienced behaviour description more precise

Attached some nvidia-bug-report.sh outputs:

  1. The one with "normalclock-hang" -prefix is with stock settings. Xid error 16, along with other errors...
  2. "underclocked-fallofbus" is with the -100MHz underclock. (GPU falls of the bus)
  3. "ok" - output from a run of the script while the system is seeming to run normally (before crash).

Faulty power supply?

generix: That’s a good guess. The problem here is that it could be anything of the following:

  1. Faulty MB
  2. Faulty RAM
  3. Faulty PS
  4. Faulty graphics card
  5. A bug in BIOS / BIOS Linux compatibility
  6. A bug in NVidia driver

I have already ruled out the graphics card being faulty - I loaned it to a friend (who already had a GTX970) and it has been running without a hitch for a few days now.

I also know now, that the system runs fine with a lesser graphics card (GTX 660) - but that might not stress the PCIe bus enough (to make a faulty MB actually fail).

My bet is currently on a faulty MB. I’m going to do some more testing when I get the graphics card back from the diagnostics session at my friends PC =). I’m also hoping it could be some BIOS setting that does not work well with Linuxes (I’m just shooting with a shotgun here - perhaps ASPM?).

FWIW I have sensor data for over a year, which also shows the voltages. +5V, +12V, +3.3V CPU Vcc, AVCC and Vbat are as stable as they can be (I also got core voltages logged, but I’m not sure what they’re range should be - as they are going up&down according to P-states - and if they were wrong, then the problem would be in MB in any case).

I’m just hoping it is somthing I can fix, since although the MB has warranty, it will take a while to replace…

(p.s. did you notive the other user with the exact same graphics card and a very similar problem?)

Of course all just a guess but I think this could be related to power. When you underclocked you got the fallen off the bus message, so maybe the card could cleanly remove itself instead of hard crashing because of lower power consumption. Maybe rule this out by limiting the max GPU/memory frequencies thus limiting max power consumption.
It gets power from the supply through the PCIe and through the extra connectors. If you’re mentioning the other thread about the same card (didn’t find it) maybe it’s using power from the PCIe just above the specs which your MB doesn’t provide? So kind of an incompatibility.
Still, another power supply is the easiest to try.

It’s been a while, but I’ve since determined this has (had) to been a HW issue after all - and decided to report here (in case someone has similar issues).

I specifically froze my system updates (Kernel/NVidia drivers) for a while so that there would be less variables (during my GTX660 swap). After I re-installed the GTX970 … the stability issues are gone!

So essentially, all I have done is 1) re-seated the graphics card and 2) re-seated the RAM on the MB.

This definitely rules out any software (driver) issue.

There is still the possibility of a hair fracture on the MB, or broken power circuitry on the MB (whatever it is, it malfunctions only intermittently). I kind of hope that whatever it is, it will break down properly and utterly before the warranties wear out =D (all my MB, PSU and graphics card are still under warranty)

So marking as solved!

p.s. Thanks generix for your suggestions, though. As I said, power issues were my first bet… however I’d presume if it was the PSU, there should be at least some voltage fluctuations (there are none, and I log my voltages every 30 seconds and have monitored real-time during specific stress tests). I’d find it more likely the power circuitry on the MB is flaky. Now I have my fingers crossed the system will either stay stable (as it now) or break down properly, so that it is more easy to find out the real cuplrit. This is the problem with diagnosing intermittent problems - it can be a real PITA to diagnose…