NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port

Background:

I’ve setup a machine for running CUDA codes on 4x GTX 680 cards using the Z77 platform (vs X79 that is not officialy supported by nvidia at PCIe 3.0 speed). The machine is built around a mobo (Asus P8Z77 WS, latest BIOS 3205) featuring four PCIe 3.0 16x slots managed by a PLX 8747 switch to split the 16x PCIe 3.0 lanes from the CPU into four 8x lanes. The machine has been running rock stable for weeks with a single GTX 680 card plugged into the first PCIe 3.0 slot. It is powered with a 1350W PSU with the extra Molex plugged onto the mobo. The integrated Intel GPU is used for display, and no screen is plugged on the GTX 680 card(s).

Problem:

In short: a PCIe 3.0 card plugged into the 4th slot, and only that one, renders the nvidia driver unstable. It does not happen neither with PCIe 3.0 cards plugged into slots 1 to 3 nor PCIe 2.0 cards in slot 4.

Going to the 4x GTX 680 setup, I started to get within minutes these nvidia driver Xid 59 errors reported always for the same card. It turns out that the cards all work fine as long as they are plugged into any of the 3 first slot except the 4th slot. The error happens after a while just by running ‘nvidia-smi -l 1’ or earlier by loading the card(s) with a cuda memcheck test. Interestingly, the problem never occurs within hours of testing with PCIe 2.0 cards (GTX 580, Quadro 4000). So it seems that the stability of the nvidia driver is challenged by running a PCIe 3.0 card on this 4th slot, and only that one, which is the most distant from the PLX switch.

Typically, for a single card plugged in the 4th slot one gets:

[ 9.386032] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 310.19 Thu Nov 8 00:52:03 PST 2012
[ 54.368022] nvidia 0000:05:00.0: irq 66 for MSI/MSI-X
[ 54.999320] NVRM: GPU at 0000:05:00: GPU-3c52f841-dcb9-73e0-7609-361293e889d3
[ 348.596567] NVRM: Xid (0000:05:00): 59, 0098(209c) 0400c287 12c93713
[ 393.437650] [sched_delayed] sched: RT throttling activated
[ 686.339589] nvidia 0000:05:00.0: irq 66 for MSI/MSI-X
[ 690.486892] NVRM: RmInitAdapter failed! (0x27:0x38:1077)
[ 690.486906] NVRM: rm_init_adapter(0) failed

and then nvidia-smi reports:

NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
NVIDIA-SMI has failed because it couldn’t communicate with NVIDIA driver. Make sure that latest NVIDIA driver is installed and running.

When the Xid error 59 happens, the system does not necessarily freeze depending on the kernel and nvidia driver versions; however the whole system clearly gets unusable for CUDA computing. I’ve been testing various combinations of kernels/distros (only Ubuntu so far: 8.04, 10.04, and here 12.04 with a 3.5 Linux kernel) and underclock setups for the hardware to no avail. The longest stable test I managed to get was about 45 minutes of all 4 GTX 680 cards loaded with cuda memtests using Ubuntu 8.04 and the 295.59 drivers, but once frozen the machine could not be recover except with hard boot. Given that and the fact that PCIe 2.0 card are able to work for hours, I don’t think this is a hardware failure of the mobo in particular; again individually all GTX 680 run fine.

Questions:

  1. What this Xid 59 error really mean? I’ve never seen it ever reported before.
  2. Why only with the 4th PCIe slot? Could it be that the signal gets too weak or latency increases for too long (esp. through the PLX swith) which confuses the nvidia driver?
  3. As a temporarly workaround to be able to use my 4 cards and not only 3, is it possible to force PCI 2.0 on all or a given slot until the nvidia driver fixes the problem?

Thanks!

It’s possible that the fourth slot might just be too far away to negotiate a satisfactory link, yes. I’ll ask our PCIe experts for any input they might have on this.

Can someone please explain in a few words what this Xid 59 error means? Is it a general CPU-GPU-memory communication problem or does it involve a particular subsystem?

I’ve now been struggling for another month trying to solve my stability problem. The latter actually turns out to be more general than what I described above; it does not involve the 4th PCIe port alone. I’ve tested dozens of configurations, both hardware and software, trying to isolate the cause. But since the machine can be stable for days (up to 2 solid days under heavy cuda memtests on all 4 cards in parallel) before the Xid 59 error pops up, it is getting utterly tedious and depressing. My $3000 pocket-money build is getting rotten with absolutely no use for my academic research except generating headaches and lost nights.

I’ve reported on this stability problem on Asus’ forums with no answer so far. Please can I get a hint on what this very unusual Xid 59 error is all about so as to narrow down the path to the root cause?

Thank you and merry Xmas!

Some news. I’ve now been running stable for ~5 days on 3 of my 4 GTX680 cards stressed independantly with a cuda memtest run. The idle GTX card is on the 3rd slot; according to lspci output, this 3rd slot is actually the first to be enumerated:

% sudo lspci -tv
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller
           +-01.0-[01-06]----00.0-[02-06]--+-08.0-[05]--+-00.0  NVIDIA Corporation GK104 [GeForce GTX 680]
           |                               |            -00.1  NVIDIA Corporation GK104 HDMI Audio Controller
           |                               +-09.0-[06]--+-00.0  NVIDIA Corporation GK104 [GeForce GTX 680]
           |                               |            -00.1  NVIDIA Corporation GK104 HDMI Audio Controller
           |                               +-10.0-[03]--+-00.0  NVIDIA Corporation GK104 [GeForce GTX 680]
           |                               |            -00.1  NVIDIA Corporation GK104 HDMI Audio Controller
           |                               -11.0-[04]--+-00.0  NVIDIA Corporation GK104 [GeForce GTX 680]
           |                                            -00.1  NVIDIA Corporation GK104 HDMI Audio Controller

and nvidia-smi reports stats using the same order (i.e. it lists cards on slots 3, 4, 1, 2) except when one passes pci=bfsort to boot the kernel (i.e. nvidia-smi then lists cards on slots 1, 2, 3, 4). Note that the “NVRM: Xid (0000:05:00): 59” message reported above therefore points to the device on the third slot, i.e. the first enumerated, and for the last month I always got that very same error.

Thus at the moment the error seems to occur on the first card only (here on 3rd PCIe slot; note that now the intel iGPU is no longer used as the primary VGA display from BIOS). Interestingly, lspci reports on various correctable PCIe errors but there is thus far no straight correlation with the problem reported here.

Someone else is now also reporting on this random NVRM Xid 59 error (see bug report) using another Asus Z77 board:
https://devtalk.nvidia.com/default/topic/526044/linux/x-server-1-13-1-deadlocks-randomly-on-geforce-gtx680/

That Xid error is pretty generic and basically just means that the driver lost communication with the GPU or there was an unrecoverable error on the GPU itself. There are indeed known stability problems with PCIe 3.0 on many motherboards, so it’s not too surprising that you’re seeing problems.

Does your system BIOS have an option to disable PCIe 3.0? That’s the best option if it’s available. I believe a future driver release will have an option to try to force PCIe 2.0, but a BIOS setting would be more reliable.

It seems this driver bug is all over the forums, but I am posting here because I believe my application is very similar to that of ncomp. I am running CUDA analysis on many machines (23 GPUs, on several machines) for a scientific application. Two of the machines are identical in terms of hardware. In particular, I am using the Asus P6T7 with 4 GTX 570’s. On one machine, using the 295.20 driver, everything has been fine, under heavy load for about two years. The other machine, using various 3XX drivers, is constantly hanging, with various NVRM Xid errors, or RT throttling, or os_schedule: Attempted to yield the CPU while in atomic or interrupt context errors. I have verified that the problem is not a result of hardware, and not due to running multiple GPUs. Since the 295.20 driver was working fine, I would VERY MUCH like to rollback to it. Unfortunately, I cannot seem to install it properly - I run into errors finding kernel header file “linux/version.h”, and when I supply the header file, the install script insists on the 2.6 linux kernel. I am running 3.8.0-19 and would much prefer NOT reverting to 2.6! Somehow the machine which is running properly has kernel 3.1.10-1.9, so I am not sure how I managed to install nvidia driver 295.20 on that machine. Please help me, either by solving the apparently rampant bug with the newer drivers, or helping me revert to 295.20. Thanks!

To be clear, I am encountering fatal problems with the following configurations (all x86_64 architecture):

NVIDIA Driver 304.54, kernel 3.8.0-19, 1 GTX770 in an Asus P9X79-E
NVIDIA Driver 319.32, kernel 3.8.0-19, 1 GTX770 in an Asus P9X79-E
NVIDIA Driver 304.54, kernel 3.8.0-19, 2 GTX770 in an Asus P6T7
NVIDIA Driver 319.21, kernel 3.7.10-1.11, 2 GTX770 in an Asus P6T7
NVIDIA Driver 304.54, kernel 3.7.10-1.11, 2 GTX770 in an Asus P6T7
NVIDIA Driver 319.21, kernel 3.7.10-1.11, 1 GTX570 in an Asus P6T7
NVIDIA Driver 304.54, kernel 3.7.10-1.11, 1 GTX570 in an Asus P6T7 (also, this config plus the above config with 2, 3, and 4 GTX570)

Errors occur on Ubuntu, Mint and openSUSE. The errors are different, as described in the previous post, but they manifest in the same way - after analysis proceeds for a couple days, the process using the GPU hangs and cannot be killed (become defunct if killed, but continues using CPU). Eventually the machine hangs (no VGA output, cannot ssh). This happens whether the machines are booted headless or not, and whether I start X or not. The only way to restore functionality is a power cycle - restarting is not sufficient. In all cases I can rule out everything but the driver or the kernel, and I strongly suspect the driver. All configurations with NVIDIA Driver 295.20 or 295.41 are stable for months. I cannot manage to install the 295 driver because of problems with the script, apparently related to the newer kernels.