Tesla K40 and Linux Kernel Interrupt problem

Hi, all. I’ll try to keep this short, but I’m often not very good at that.

I’ve got a set of 32 Dell PE R730 servers with 2 Tesla K80s each, that are working great using Nvidia driver 361.28. I’m trying to add a single K40m to a couple of other Dell PE R730 servers, which didn’t have GPUs previously, using the same OS image and driver, and I’m getting this output when I run “nvidia-smi”:

nvidia-smi

Unable to determine the device handle for GPU 0000:03:00.0: The NVIDIA kernel module detected an issue with GPU interrupts.Consult the “Common Problems” Chapter of the NVIDIA Driver README for
details and steps that can be taken to resolve this issue.

In this state, either after a period of time, or if I try to use the GPU (eg. through CUDA), I get a kernel panic. Using something like “sosreport” triggers the kernel panic every time.

We do the nvidia driver install (eg. “NVIDIA-Linux-x86_64-361.28.run --silent”) during the first boot after a re-image of a server. Just today I discovered that if I reboot (but not reinstall) the server, then it seems to recover, and I get expected output from “nvidia-smi” after the clean reboot. This has been consistent behavior over several instances of reinstall and then clean reboot, and with both the 358.13 and 361.28 drivers.

I’ve tried just simply doing a “modprobe -r nvidia; modprobe nvidia” to see if it also cleans it up, but that didn’t seem to do anything.

If needed, I can script something that selectively reboots after install, if the host has a K40 in it. I just was hoping someone could shed some more light on what might be going on, and if there’s a cleaner way to fix this. I’m having trouble understanding what the problem is, let alone why it affects the 1xK40 node, but not the 2xK80 nodes.

Any thoughts here?

Thanks,
Lloyd
nvidia-bug-report.log.gz (61.6 KB)

As an update, I still don’t have a solution, and it turns out that my attempt to workaround by cleanly rebooting, isn’t consistent enough. Sometimes it works, and sometimes it doesn’t. Also sometimes the server will go from failing to working state, or working to failing state, without any clear reason why.

I’ve tested this on two separate servers, and two separate Tesla K40m cards, with the same symptoms. This argues strongly for a problem with my software image, rather than the hardware.

The one further clue I’ve found is that messages like this are showing up in my /var/log/messages, about the same time as the “nvidia-smi” with the fail message:

Jul 20 09:36:14 m8int02 kernel: do_IRQ: 0.131 No irq handler for vector (irq -1)
Jul 20 09:36:14 m8int02 kernel: NVRM: RmInitAdapter failed! (0x12:0x45:1937)
Jul 20 09:36:14 m8int02 kernel: NVRM: rm_init_adapter failed for device bearing minor number 0
Jul 20 09:36:28 m8int02 kernel: do_IRQ: 0.131 No irq handler for vector (irq -1)
Jul 20 09:36:32 m8int02 kernel: do_IRQ: 0.131 No irq handler for vector (irq -1)
Jul 20 09:36:32 m8int02 kernel: NVRM: RmInitAdapter failed! (0x12:0x45:1937)
Jul 20 09:36:32 m8int02 kernel: NVRM: rm_init_adapter failed for device bearing minor number 0
Jul 20 09:36:32 m8int02 kernel: do_IRQ: 0.131 No irq handler for vector (irq -1)
Jul 20 09:36:36 m8int02 kernel: do_IRQ: 0.131 No irq handler for vector (irq -1)
Jul 20 09:36:36 m8int02 kernel: NVRM: RmInitAdapter failed! (0x12:0x45:1937)
Jul 20 09:36:36 m8int02 kernel: NVRM: rm_init_adapter failed for device bearing minor number 0

Also, I tried to upload an “nvidia-bug-report.log.gz”, back on the 14th, but it seems to still be stuck in a “[SCANNING… PLEASE WAIT]” state when I’m logged in, and invisible when I’m not logged in. Not sure what I can do to push that through, but if you need any information from it, I can try posting directly, I guess.

Lloyd

This is prelminary, but I believe the problem is related to the Linux kernel in some way. I’ve upgraded from RHEL’s 2.6.32-504.16.2.el6.x86_64 to RHEL’s 2.6.32-642.3.1.el6.x86_64, and so far, I’ve been unable to trigger the problem.

It’s preliminary, but I thought I’d put this here for posterity. Hopefully someone else won’t have to fight through it like I did.

For further posterity reference, it turns out that this was not a complete solution. The problem continued to happen, just intermittently. I experienced random-seeming failures, that I was never able to correlate.

I did determine that if I provisioned the host with the earlier kernel, and then used the CLI “rpm” to upgrade to the newer kernel, it seemed to work. I haven’t had an opportunity to continue diagnosing this, since other projects took precedence.

It is worth noting, however, that a somewhat similar issue is occurring with 2.6.32-642.6.2, and Tesla k80s. I’ve got SRs open with both RH and Nvidia support, and will do better trying to update this when I have an answer.

This is slightly embarrassing.

The probably-related issue with K80s and the 2.6.32-642.6.2 kernel, seems to be related to my failure to properly blacklist the nouveau driver. Somehow, as of the 2.6.32-642 series from RH, there’s a difference between the nouveau driver loading and subsequently being unloaded (eg. “modprobe -r nouveau”), and never having loaded it at all. That didn’t seem to be the case with prior kernels.

In our case, we are putting a ‘blacklist nouveau’ file in /etc/modprobe.d/blacklist-nouveau.conf, and appending “rdblacklist=nouveau” to the kernel command-line. You could also just re-run dracut to regenerate the initramfs file, instead of the kernel command-line parameter.

I have yet to test this on the k40 nodes, but I hope to do that in the next several days.

Just restart the machine.