Kepler Titans throw RmInitAdapter Failed using Threadripper and are unusable

Hi,

I’ve been having a strange problem with my new threadripper 1950x system running ubuntu 17.04, with kernel 4.10.0-35 and driver 384.81 (although I’ve also tried earlier drivers and had the same issue). In short, I can’t get any Kepler era GPUs to function. Instead, they fail during bootup with the message “NVRM: RmInitAdapter failed!”.

Currently, this system has a GTX 1050 to drive monitors, a Titan Xp, and a Kepler Titan. If the Kepler Titan is removed, everything works well. If only the Kepler Titan is installed I get these same error messages. I have reproduced this problem with multiple Kepler Titans and verified that they work on another system, so it seems that my computer is the issue.

I’m at a loss to fix this problem and I was hoping that someone else might have some insight.

nvidia-bug-report.log.gz (358 KB)

Looks like you’re using the cuda bundles driver. Can you check if the ‘normal’ 384.90 or the new beta 387.12 driver work?

Hi,

Thanks for the suggestion. Unfortunately, both of those drivers fail as well, in what seems to be the same way. If I run with just the Pascal cards then everything is fine, but if the Kepler card is installed I get the same failure mode. I’m attaching those bug reports as well, if it’s helpful.

Best,
Eric
nvidia-bug-report.log.384.90.gz (344 KB)
nvidia-bug-report.log.387.12.gz (343 KB)

Then maybe check, if

  • no overclocking is set in bios
  • the kernel parameters iommu=off amd_iommu=off help
  • the kepler works in the slot the titan xp is currently seated.

Thanks for working with me on this. I’ve gone back to the 384.81 driver, but I can swap to the newer one or the beta if that would help.

-There’s no overclocking in bios
-I tried iommu=off amd_iommu=off and no change in behavior, still the same problem
-I tried putting the kepler in the slot used by the titan xp and no change in behavior, still the same problem (I took another bug report for this configuration and with the above kernel parameters, if it’s helpful)

Best,
Eric
nvidia-bug-report.log.384.81.nommu.gz (267 KB)

Now that looks even worse. On this bus it’s failing even before X starts, on init of the audio function. Points more to a general hardware incompatibility.
You’re running a rather old bios, ASRock has released some updates in quick succession, so maybe some problems there. If that doesn’t help, try latest kernel. Other than that, there’s only the last resort of throttling the bus from gen3 to gen2 if the bios allows that and hope that works. In any case, you should contact ASRock support about the issue.

I hadn’t realized that there was a new BIOS available. I’ve upgraded to the 1.70 BIOS and so far it doesn’t seem to have helped. As a note that might be helpful to others in my predicament, it is necessary to set the “promontory pcie bridge” to “gen2” or else you will get all sorts of TLP and DLP errors. I also tried setting the IOMMU to off in the bios, but that doesn’t seem to help. I’m about to try the 4.13.5 kernel and I’ll report back with the results.
nvidia-bug-report.log.newBios.gz (340 KB)

Looks like 4.13.5 didn’t do any better. I guess that my next step is to try out Windows and see if the hardware works there and then contact ASRock. Thanks for all of your help!

Amazingly, I think that I may have succeeded. I switched the PCI switch setting to GEN2 and used the “pci=nommconf” kernel parameter and now everything seems to work!

Hi! I am facing the same problem here: GTX Titan + Threadripper 1950X, but I am using msi MOBO. So basically you downgrade the PCIE from 3.0 to 2.0 and it works, is that right? Will that affect the card’s performance?

Thanks!

I think pcie is a bit problematic in general with the X399 chipset. So not vendor specific.
I remember some user reported a 15% performance loss when switching from gen3 to gen2 in a different context.

Yes, downgrading both the normal and the promontory PCIE Bridge to 2.0 seems to have done the trick. I haven’t done any pre/post performance tests, but in my case I don’t expect it to make much difference as my workload is not limited by transfer speed from the host to the device.

It’s surprising that this problem exists. I’m curious if it’s just an issue under linux or if it happens under windows as well?

Many TR customers were surprised as well…afaik especially pcie-passthrough/iommu is flaky. So people who would use it e.g. as virt host are hit. AMD knows about that and is looking into it. It’s hard to tell if Windows is hit as well as you would need Win Server 2016 for those scenarios and who would do that?

Ok, thanks!

I was wondering if anyone has tried other Kepler cards on the platform? Like Titan black or Titan Z?
The recent BIOS updates logs for my MSI MOBO (x399 carbon whatever…) suggests that new support for Titan Z is added. But I don’t know if it is about only Titan Z or the whole Kepler series.

Oh someone has notified AMD? Great! Do you have any reports of similar issues on MOBO of other vendors? the MSI one is giving me a hard time on Linux (Ubuntu 16.04) and their custom service apparently has no idea of what a computer looks like!

Yes I am aware of the slower transfer speed of the PCIE 2.0, but I mostly use the GPU for computation and transferring data is not a bottleneck for me, so I will take that route.

Like said, it seems the X399 chipset is affected in general, so all TR boards from all vendors. Just search for threadripper pcie passthrough and you will stumble upon a lot of threads. Inside many of those posts people mention that apart from pcie pt, several pcie cards fail to function.
AMD has even set up a customer survey site for pcie pt.