Second GPU board (Tesla C2075) doesn't appear in the NVIDIA X Server Settings!

WilsonPardiJunior · August 19, 2016, 9:30am

Dear forum members,

I’m trying to setup and use a computer system with the following hardware and software:

Operating System: CentOS Linux (3.10.0-327.18.2.e17.x86_64) 7
Motherboard X99-A BIOS version 3101
RAM memory: 32 GB
Primary graphics board: NVIDIA Ge Force GT 610
Secondary graphics board: NVIDIA Tesla C2075
NVIDIA driver version 361.42

I’m able to start the system in graphics mode using the Ge Force GT 610 as the default graphics board without any problems.

However, when I open the NVIDIA X Server Settings it shows only the first GPU board (GPU-0), i.e., the Ge Force GT 610 board.
It doesn’t show the second GPU board (GPU-1: Tesla C2075 board)!

The following error messages appear in the Xorg.0.log:

i [drm] Failed to open DRM device for (null): -22
(EE) NVIDIA (GPU-1): Failed to initialize the NVIDIA GPU at PCI: 2:0:0
(EE) NVIDIA (GPU-1): Please check your system’s kernel log for additional error messages
(EE) NVIDIA (GPU-1): Failed to initialize the NVIDIA graphics device![/i]

Applying dmesg | grep NVRM results:

NVRM: failed to copy vbios to system memory
NVRM: RmInitAdapter failed! (0x30:0xffff:657)
NVRM: rm-init-adapter failed for device bearing minor number 0

Also, applying dmesg | grep error results:

ioapic: probe of 0000:00:05.4 failed with error -22

Additionally, invoking nvidia-smi from command prompt shows:

Unable to determine the device handle for gpu 0000:02:00:0: Insufficient Permissions

And invoking nvidia-debugdump -l shows:

No permission to talk to device 0000:02:00:0

It seems to me that there’s something restricting the access to the second GPU board (Tesla C2075), but I confess that I don’t have any idea what’s the cause of this problem…

My purpose is use the first GPU board (Ge Force GT 610) as a graphical board only and the second GPU board (Tesla C2075) just for computation (using CUDA C).

I would be extremely thankful to anyone that can explain how to fix this problem.

Best Regards from Japan,

Wilson Pardi Junior

Robert_Crovella · August 19, 2016, 2:46pm

Are you using the latest BIOS for your motherboard?

Do you have all necessary aux power connections properly made to the C2075?

WilsonPardiJunior · August 25, 2016, 8:21am

Hi txbob,

Thank you for your concern to my problem described above. I really appreciate it!

Regarding your questions:

“Are you using the latest BIOS for your motherboard?”

Actually, I wasn’t!

I have updated the motherboard X99-A BIOS this week to its most recent version (released on June 28th), i.e., version 3301.

However, the BIOS update haven’t solved my problem… :-(

“Do you have all necessary aux power connections properly made to the C2075?”

Yes, I think so.

The Tesla C2075 board is being powered by two PCI-Express power cables: one 8-pin and one 6-pin.
(BTW, the computer system has a 850 W power supply).

Besides, if the Tesla C2075 board is not properly connected to the power supply then the error message shouldn’t be “Unable to determine the device handle for gpu 0000:02:00:0: Unable to communicate with GPU because it is insufficiently powered” ??

I’m still trying to figure out how to fix this problem…

Best Regards from Japan,

Wilson Pardi Junior

Robert_Crovella · August 25, 2016, 2:12pm

have you properly removed the nouveau driver?

WilsonPardiJunior · August 29, 2016, 4:33am

Hi txbob,

As far I know the nouveau driver isn’t being loaded, or at least, used by the NVIDIA boards.

Some results:

lsmod | grep nouveau doesn’t return nothing.

lsmod | grep nvidia returns:

nvidia_modeset 742374 3
nvidia 10026452 59 nvidia_modeset
drm 349210 3 nvidia
i2c_core 40582 3 drm, i2c, i801, nvidia

lspci -nnk | grep -iA2 vga returns:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF119 [GeForce GT 610] [10de:104a] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:098e]
kernel driver in use: nvidia

lspci -nnk | grep -iA2 3d returns:

02:00.0 3D controller [0302]: NVIDIA Corporation GF110GL [Tesla C2050 / C2075] [10de:1096] (rev a1)
kernel driver in use: nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GF110 High Definition Audio Controller [10de:0e09] (rev a1)

On my first message I forgot to mention that dmesg | grep NVRM also returns:

NVRM: Your system is not currently configured to drive a VGA console
NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
NVRM: requires the use of a text-mode VGA console. Use of other console
NVRM: drivers including, but not limited to, vesafb, may result in
NVRM: corruption and stability problems, and is not supported.

However, the error message doesn’t seem to affect the first NVIDIA board (GeForce GT 610) since I can boot in the graphical mode without any problems.

Now something interesting has called my attention:

I did some tests with other NVIDIA driver versions too.

As I wrote before using the NVIDIA driver version 361.42 when invoking nvidia-smi from command prompt shows:

Unable to determine the device handle for gpu 0000:02:00:0: Insufficient Permissions

And invoking nvidia-debugdump -l shows:

No permission to talk to device 0000:02:00:0

However, when using the most recent NVIDIA driver, version 367.44 (for GeForce GT 610 board) or version 352.99 (for Tesla C2075 board) invoking of nvidia-smi and nvidia-debugdump don’t show nothing about the Tesla board (even attachment)! I wonder why an older version shows at least some info about the Tesla board but the most recent versions don’t show nothing…

Best Regards from Japan,

Wilson Pardi Junior