P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X

Hi, we’ve been trying various driver versions (both RPM and .run) on our EL6/7 Dell R740xds with no success. I’ll paste a bunch of output below but ultimately it seems as if the driver’s half working. The card is detected, but there’s a lot of output that doesn’t make sense and X won’t load.

cat /proc/driver/nvidia/gpus/0000:3b:00.0/information 
Model: 		 Tesla P100-PCIE-12GB
IRQ:   		 324
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:3b:00.0
Device Minor: 	 0
dmesg -T | grep -i -e nvidia -e nvrm
[Tue Dec  5 12:19:34 2017] nvidia: loading out-of-tree module taints kernel.
[Tue Dec  5 12:19:34 2017] nvidia: module license 'NVIDIA' taints kernel.
[Tue Dec  5 12:19:34 2017] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[Tue Dec  5 12:19:34 2017] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[Tue Dec  5 12:19:34 2017] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  384.98  Thu Oct 26 15:16:01 PDT 2017 (using threaded interrupts)
[Tue Dec  5 12:19:34 2017] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  384.98  Thu Oct 26 14:41:13 PDT 2017
[Tue Dec  5 12:19:35 2017] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[Tue Dec  5 12:20:15 2017] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 238
[Tue Dec  5 12:20:15 2017] nvidia 0000:3b:00.0: irq 324 for MSI/MSI-X

From Xorg.0.log

[    47.728] (II) NVIDIA dlloader X Driver  384.98  Thu Oct 26 14:06:45 PDT 2017
[    47.728] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[    47.728] (++) using VT number 1

[    47.730] (EE) No devices detected.
[    47.730] (EE) 
Fatal server error:
[    47.730] (EE) no screens found(EE) 
[    47.730] (EE)

I’ve tried 375, 381 and 384 drivers. I’ve also updated the R740xd to the latest BIOS available and run the NVIDIA Firmware Update Utility (v5.402.0) from Dell’s support site. I’ve tried using the version of the driver downloaded from both Dell’s support site and from NVIDIA’s site directly.

Any help would be really appreciated.

Another weird thing - nvidia-smi seems to work:

nvidia-smi -L
GPU 0: Tesla P100-PCIE-12GB (UUID: GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448)
nvidia-smi -q
==============NVSMI LOG==============

Timestamp                           : Tue Dec  5 13:08:12 2017
Driver Version                      : 384.81

Attached GPUs                       : 1
GPU 00000000:3B:00.0
    Product Name                    : Tesla P100-PCIE-12GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322517077813
    GPU UUID                        : GPU-b16a8955-5b72-9299-36e8-6ccbf0ccc448
    Minor Number                    : 0
    VBIOS Version                   : 86.00.41.00.07
    MultiGPU Board                  : No
    Board ID                        : 0x3b00
    GPU Part Number                 : 900-2H400-0110-030
    Inforom Version
        Image Version               : H400.0202.00.01
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x3B
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F710DE
        Bus Id                      : 00000000:3B:00.0
        Sub System Id               : 0x11DA10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 1000 KB/s
        Rx Throughput               : 1000 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 12193 MiB
        Used                        : 0 MiB
        Free                        : 12193 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                CBU                 : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                CBU                 : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                CBU                 : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 26 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 27.26 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1189 MHz
        SM                          : 1189 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 715 MHz
        Video                       : 1328 MHz
    Max Customer Boost Clocks
        Graphics                    : 1328 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

Probably worth me elaborating a bit on what we’re doing. We use these boxes for remote X sessions using noMachine. The xorg.conf looks like this:

Section "DRI"
	Mode 0666
EndSection

Section "Files"
	ModulePath   "/usr/lib64/xorg/modules/extensions/nvidia"
	ModulePath   "/usr/lib64/xorg/modules"
	FontPath     "/usr/share/fonts/default/Type1"
EndSection

Section "ServerFlags"
	Option "IgnoreABI" "True"
	Option "nolisten" "True"
	Option "AutoAddDevices" "False"
EndSection

Section "ServerLayout"
	Identifier "layout"
	Screen 0 "nvidia" 0 0
EndSection

Section "Device"
	Identifier "nvidia"
	Driver "nvidia"
        BusID "3b:0:0"
EndSection

Section "Screen"
	Identifier "nvidia"
	Device "nvidia"
	Option "UseDisplayDevice" "none"
        Option "Overlay" "True"
        Option "UseEvents" "False"
EndSection

Section "Extensions"
	Option "Composite" "Disable"
EndSection

Since nvidia-smi was working I was wondering whether there was just something funky going on with the /proc output but with X still not being happy I’m not sure where to go next.

Without a monitor you have to add
Option “AllowEmptyInitialConfiguration”
to the device section of xorg.conf.
The pci busid in xorg.conf is decimal, not hexadecimal, 3b=59

Thank you. I’m still not quite in (seems like some noMachine issues) but I do notice that having corrected the Bus ID and adding that conf option, the /proc/driver output is now correct.

Should X have to be running for the contents of /proc/driver/nvidia to be complete and correct?

I’ll fight with noMachine a bit further and hopefully reply shortly saying all’s well.

Couple of tweaks needed to get noMachine to let me have a desktop session but now it is opening up. Thank you very much generix.

For anyone else’s reference…

Section "ServerLayout"
	Identifier     "Layout0"
	Screen      0  "Screen0" 0 0
EndSection

Section "Files"
	ModulePath   "/usr/lib64/xorg/modules/extensions/nvidia"
	ModulePath   "/usr/lib64/xorg/modules"
	FontPath     "/usr/share/fonts/default/Type1"
EndSection

Section "ServerFlags"
	Option	    "IgnoreABI" "True"
	Option	    "nolisten" "True"
	Option	    "AutoAddDevices" "False"
EndSection

Section "Monitor"
	Identifier   "Monitor0"
	VendorName   "Unknown"
	ModelName    "Unknown"
	HorizSync    30.0 - 110.0
	VertRefresh  50.0 - 150.0
	Option	    "DPMS"
EndSection

Section "Device"
	Identifier  "Device0"
	Driver      "nvidia"
	BusID       "59:0:0"
        Option      "AllowEmptyInitialConfiguration"
EndSection

Section "Screen"
	Identifier "Screen0"
	Device     "Device0"
	Monitor    "Monitor0"
	Option	   "UseEvents" "False"
EndSection

Section "DRI"
	Mode         0666
EndSection

Section "Extensions"
	Option	    "Composite" "Disable"
EndSection

Just to clarify for the first few posts: This is expected behavior. The proc interface is created as soon as the nvidia.ko kernel module is loaded, but some of the data isn’t queried from the GPU until the driver is actually initialized. That’s why you’ll see the relevant information if something else (such as an X server or nvidia-persistenced) is keeping the /dev/nvidia* devices open, and question marks otherwise. The reason it works with nvidia-smi is that it opens /dev/nvidia*, queries the information, and then closes it.

If you want to keep the GPU initialized all the time even if no other clients are using it, that’s what nvidia-persistenced is for.