P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X
Hi, we've been trying various driver versions (both RPM and .run) on our EL6/7 Dell R740xds with no success. I'll paste a bunch of output below but ultimately it seems as if the driver's half working. The card is detected, but there's a lot of output that doesn't make sense and X won't load. [code]cat /proc/driver/nvidia/gpus/0000:3b:00.0/information Model: Tesla P100-PCIE-12GB IRQ: 324 GPU UUID: GPU-????????-????-????-????-???????????? Video BIOS: ??.??.??.??.?? Bus Type: PCIe DMA Size: 47 bits DMA Mask: 0x7fffffffffff Bus Location: 0000:3b:00.0 Device Minor: 0[/code] [code]dmesg -T | grep -i -e nvidia -e nvrm [Tue Dec 5 12:19:34 2017] nvidia: loading out-of-tree module taints kernel. [Tue Dec 5 12:19:34 2017] nvidia: module license 'NVIDIA' taints kernel. [Tue Dec 5 12:19:34 2017] nvidia: module verification failed: signature and/or required key missing - tainting kernel [Tue Dec 5 12:19:34 2017] nvidia-nvlink: Nvlink Core is being initialized, major device number 243 [Tue Dec 5 12:19:34 2017] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.98 Thu Oct 26 15:16:01 PDT 2017 (using threaded interrupts) [Tue Dec 5 12:19:34 2017] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 384.98 Thu Oct 26 14:41:13 PDT 2017 [Tue Dec 5 12:19:35 2017] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver [Tue Dec 5 12:20:15 2017] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 238 [Tue Dec 5 12:20:15 2017] nvidia 0000:3b:00.0: irq 324 for MSI/MSI-X[/code] From Xorg.0.log [code][ 47.728] (II) NVIDIA dlloader X Driver 384.98 Thu Oct 26 14:06:45 PDT 2017 [ 47.728] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs [ 47.728] (++) using VT number 1 [ 47.730] (EE) No devices detected. [ 47.730] (EE) Fatal server error: [ 47.730] (EE) no screens found(EE) [ 47.730] (EE) [/code] I've tried 375, 381 and 384 drivers. I've also updated the R740xd to the latest BIOS available and run the NVIDIA Firmware Update Utility (v5.402.0) from Dell's support site. I've tried using the version of the driver downloaded from both Dell's support site and from NVIDIA's site directly. Any help would be really appreciated.
Hi, we've been trying various driver versions (both RPM and .run) on our EL6/7 Dell R740xds with no success. I'll paste a bunch of output below but ultimately it seems as if the driver's half working. The card is detected, but there's a lot of output that doesn't make sense and X won't load.

cat /proc/driver/nvidia/gpus/0000:3b:00.0/information 
Model: Tesla P100-PCIE-12GB
IRQ: 324
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:3b:00.0
Device Minor: 0


dmesg -T | grep -i -e nvidia -e nvrm
[Tue Dec 5 12:19:34 2017] nvidia: loading out-of-tree module taints kernel.
[Tue Dec 5 12:19:34 2017] nvidia: module license 'NVIDIA' taints kernel.
[Tue Dec 5 12:19:34 2017] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Tue Dec 5 12:19:34 2017] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[Tue Dec 5 12:19:34 2017] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.98 Thu Oct 26 15:16:01 PDT 2017 (using threaded interrupts)
[Tue Dec 5 12:19:34 2017] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 384.98 Thu Oct 26 14:41:13 PDT 2017
[Tue Dec 5 12:19:35 2017] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[Tue Dec 5 12:20:15 2017] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 238
[Tue Dec 5 12:20:15 2017] nvidia 0000:3b:00.0: irq 324 for MSI/MSI-X


From Xorg.0.log
[    47.728] (II) NVIDIA dlloader X Driver  384.98  Thu Oct 26 14:06:45 PDT 2017
[ 47.728] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 47.728] (++) using VT number 1

[ 47.730] (EE) No devices detected.
[ 47.730] (EE)
Fatal server error:
[ 47.730] (EE) no screens found(EE)
[ 47.730] (EE)


I've tried 375, 381 and 384 drivers. I've also updated the R740xd to the latest BIOS available and run the NVIDIA Firmware Update Utility (v5.402.0) from Dell's support site. I've tried using the version of the driver downloaded from both Dell's support site and from NVIDIA's site directly.

Any help would be really appreciated.

#1
Posted 12/05/2017 09:01 PM   
Another weird thing - nvidia-smi seems to work: [code]nvidia-smi -L GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448)[/code] [code]nvidia-smi -q ==============NVSMI LOG============== Timestamp : Tue Dec 5 13:08:12 2017 Driver Version : 384.81 Attached GPUs : 1 GPU 00000000:3B:00.0 Product Name : Tesla P100-PCIE-12GB Product Brand : Tesla Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 1920 Driver Model Current : N/A Pending : N/A Serial Number : 0322517077813 GPU UUID : GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448 Minor Number : 0 VBIOS Version : 86.00.41.00.07 MultiGPU Board : No Board ID : 0x3b00 GPU Part Number : 900-2H400-0110-030 Inforom Version Image Version : H400.0202.00.01 OEM Object : 1.1 ECC Object : 4.1 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None PCI Bus : 0x3B Device : 0x00 Domain : 0x0000 Device Id : 0x15F710DE Bus Id : 00000000:3B:00.0 Sub System Id : 0x11DA10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : 1000 KB/s Rx Throughput : 1000 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active FB Memory Usage Total : 12193 MiB Used : 0 MiB Free : 12193 MiB BAR1 Memory Usage Total : 16384 MiB Used : 2 MiB Free : 16382 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile Single Bit Device Memory : 0 Register File : 0 L1 Cache : N/A L2 Cache : 0 Texture Memory : 0 Texture Shared : 0 CBU : N/A Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : N/A L2 Cache : 0 Texture Memory : 0 Texture Shared : 0 CBU : N/A Total : 0 Aggregate Single Bit Device Memory : 0 Register File : 0 L1 Cache : N/A L2 Cache : 0 Texture Memory : 0 Texture Shared : 0 CBU : N/A Total : 0 Double Bit Device Memory : 0 Register File : 0 L1 Cache : N/A L2 Cache : 0 Texture Memory : 0 Texture Shared : 0 CBU : N/A Total : 0 Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 26 C GPU Shutdown Temp : 85 C GPU Slowdown Temp : 82 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 27.26 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 250.00 W Clocks Graphics : 1189 MHz SM : 1189 MHz Memory : 715 MHz Video : 835 MHz Applications Clocks Graphics : 1189 MHz Memory : 715 MHz Default Applications Clocks Graphics : 1189 MHz Memory : 715 MHz Max Clocks Graphics : 1328 MHz SM : 1328 MHz Memory : 715 MHz Video : 1328 MHz Max Customer Boost Clocks Graphics : 1328 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None [/code] Probably worth me elaborating a bit on what we're doing. We use these boxes for remote X sessions using noMachine. The xorg.conf looks like this: [code]Section "DRI" Mode 0666 EndSection Section "Files" ModulePath "/usr/lib64/xorg/modules/extensions/nvidia" ModulePath "/usr/lib64/xorg/modules" FontPath "/usr/share/fonts/default/Type1" EndSection Section "ServerFlags" Option "IgnoreABI" "True" Option "nolisten" "True" Option "AutoAddDevices" "False" EndSection Section "ServerLayout" Identifier "layout" Screen 0 "nvidia" 0 0 EndSection Section "Device" Identifier "nvidia" Driver "nvidia" BusID "3b:0:0" EndSection Section "Screen" Identifier "nvidia" Device "nvidia" Option "UseDisplayDevice" "none" Option "Overlay" "True" Option "UseEvents" "False" EndSection Section "Extensions" Option "Composite" "Disable" EndSection [/code] Since nvidia-smi was working I was wondering whether there was just something funky going on with the /proc output but with X still not being happy I'm not sure where to go next.
Another weird thing - nvidia-smi seems to work:

nvidia-smi -L
GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448)


nvidia-smi -q
==============NVSMI LOG==============

Timestamp : Tue Dec 5 13:08:12 2017
Driver Version : 384.81

Attached GPUs : 1
GPU 00000000:3B:00.0
Product Name : Tesla P100-PCIE-12GB
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322517077813
GPU UUID : GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448
Minor Number : 0
VBIOS Version : 86.00.41.00.07
MultiGPU Board : No
Board ID : 0x3b00
GPU Part Number : 900-2H400-0110-030
Inforom Version
Image Version : H400.0202.00.01
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x3B
Device : 0x00
Domain : 0x0000
Device Id : 0x15F710DE
Bus Id : 00000000:3B:00.0
Sub System Id : 0x11DA10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 1000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 12193 MiB
Used : 0 MiB
Free : 12193 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 27.26 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 1189 MHz
SM : 1189 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None


Probably worth me elaborating a bit on what we're doing. We use these boxes for remote X sessions using noMachine. The xorg.conf looks like this:

Section "DRI"
Mode 0666
EndSection

Section "Files"
ModulePath "/usr/lib64/xorg/modules/extensions/nvidia"
ModulePath "/usr/lib64/xorg/modules"
FontPath "/usr/share/fonts/default/Type1"
EndSection

Section "ServerFlags"
Option "IgnoreABI" "True"
Option "nolisten" "True"
Option "AutoAddDevices" "False"
EndSection

Section "ServerLayout"
Identifier "layout"
Screen 0 "nvidia" 0 0
EndSection

Section "Device"
Identifier "nvidia"
Driver "nvidia"
BusID "3b:0:0"
EndSection

Section "Screen"
Identifier "nvidia"
Device "nvidia"
Option "UseDisplayDevice" "none"
Option "Overlay" "True"
Option "UseEvents" "False"
EndSection

Section "Extensions"
Option "Composite" "Disable"
EndSection


Since nvidia-smi was working I was wondering whether there was just something funky going on with the /proc output but with X still not being happy I'm not sure where to go next.

#2
Posted 12/05/2017 09:23 PM   
Without a monitor you have to add Option "AllowEmptyInitialConfiguration" to the device section of xorg.conf. The pci busid in xorg.conf is decimal, not hexadecimal, 3b=59
Answer Accepted by Original Poster
Without a monitor you have to add
Option "AllowEmptyInitialConfiguration"
to the device section of xorg.conf.
The pci busid in xorg.conf is decimal, not hexadecimal, 3b=59

#3
Posted 12/05/2017 11:23 PM   
Thank you. I'm still not quite in (seems like some noMachine issues) but I do notice that having corrected the Bus ID and adding that conf option, the /proc/driver output is now correct. Should X have to be running for the contents of /proc/driver/nvidia to be complete and correct? I'll fight with noMachine a bit further and hopefully reply shortly saying all's well.
Thank you. I'm still not quite in (seems like some noMachine issues) but I do notice that having corrected the Bus ID and adding that conf option, the /proc/driver output is now correct.

Should X have to be running for the contents of /proc/driver/nvidia to be complete and correct?

I'll fight with noMachine a bit further and hopefully reply shortly saying all's well.

#4
Posted 12/05/2017 11:55 PM   
Couple of tweaks needed to get noMachine to let me have a desktop session but now it is opening up. Thank you very much generix. For anyone else's reference... [code]Section "ServerLayout" Identifier "Layout0" Screen 0 "Screen0" 0 0 EndSection Section "Files" ModulePath "/usr/lib64/xorg/modules/extensions/nvidia" ModulePath "/usr/lib64/xorg/modules" FontPath "/usr/share/fonts/default/Type1" EndSection Section "ServerFlags" Option "IgnoreABI" "True" Option "nolisten" "True" Option "AutoAddDevices" "False" EndSection Section "Monitor" Identifier "Monitor0" VendorName "Unknown" ModelName "Unknown" HorizSync 30.0 - 110.0 VertRefresh 50.0 - 150.0 Option "DPMS" EndSection Section "Device" Identifier "Device0" Driver "nvidia" BusID "59:0:0" Option "AllowEmptyInitialConfiguration" EndSection Section "Screen" Identifier "Screen0" Device "Device0" Monitor "Monitor0" Option "UseEvents" "False" EndSection Section "DRI" Mode 0666 EndSection Section "Extensions" Option "Composite" "Disable" EndSection[/code]
Couple of tweaks needed to get noMachine to let me have a desktop session but now it is opening up. Thank you very much generix.

For anyone else's reference...

Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0" 0 0
EndSection

Section "Files"
ModulePath "/usr/lib64/xorg/modules/extensions/nvidia"
ModulePath "/usr/lib64/xorg/modules"
FontPath "/usr/share/fonts/default/Type1"
EndSection

Section "ServerFlags"
Option "IgnoreABI" "True"
Option "nolisten" "True"
Option "AutoAddDevices" "False"
EndSection

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 30.0 - 110.0
VertRefresh 50.0 - 150.0
Option "DPMS"
EndSection

Section "Device"
Identifier "Device0"
Driver "nvidia"
BusID "59:0:0"
Option "AllowEmptyInitialConfiguration"
EndSection

Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
Option "UseEvents" "False"
EndSection

Section "DRI"
Mode 0666
EndSection

Section "Extensions"
Option "Composite" "Disable"
EndSection

#5
Posted 12/06/2017 12:04 AM   
Just to clarify for the first few posts: This is expected behavior. The proc interface is created as soon as the nvidia.ko kernel module is loaded, but some of the data isn't queried from the GPU until the driver is actually initialized. That's why you'll see the relevant information if something else (such as an X server or nvidia-persistenced) is keeping the /dev/nvidia* devices open, and question marks otherwise. The reason it works with nvidia-smi is that it opens /dev/nvidia*, queries the information, and then closes it. If you want to keep the GPU initialized all the time even if no other clients are using it, that's what nvidia-persistenced is for.
Just to clarify for the first few posts: This is expected behavior. The proc interface is created as soon as the nvidia.ko kernel module is loaded, but some of the data isn't queried from the GPU until the driver is actually initialized. That's why you'll see the relevant information if something else (such as an X server or nvidia-persistenced) is keeping the /dev/nvidia* devices open, and question marks otherwise. The reason it works with nvidia-smi is that it opens /dev/nvidia*, queries the information, and then closes it.

If you want to keep the GPU initialized all the time even if no other clients are using it, that's what nvidia-persistenced is for.

Aaron Plattner
NVIDIA Linux Graphics

#6
Posted 12/06/2017 01:13 AM   
Scroll To Top

Add Reply