Multi Nvidia GPUs and xorg.conf : How to account for PCI bus (BusId) change?

Hi,

/

On a Centos 6.5 system, we have 3 Quadro K2000D cards:

lspci | grep -i vga

08:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [Quadro K2000D] (rev a1)
10:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [Quadro K2000D] (rev a1)
29:00.0 VGA compatible controller: NVIDIA Corporation GK107GL [Quadro K2000D] (rev a1)

We have total 6 screens connected to these 3 cards. We have tweaked the X configuration file so that everything fine (see xorg.conf down below). And everything is fine.

Our problem is that in case of PCI enumeration changes across reboots, the PCI bus numbers allocated to the GPU cards may change. And our confiuration will probably no longer work, as the PCI bus numbers (BusId) are hard coded in the xorg.conf file.

Is there a solution to dynamically adapt to PCI bus changes while keeping our screen layout?
Thanks a lot,
-jf simon

nvidia-settings: X configuration file generated by nvidia-settings

nvidia-settings: version 331.49 (buildmeister@swio-display-x86-rhel47-10) Wed Feb 12 20:59:53 PST 2014

Section “ServerLayout”
Identifier “Layout0”
Screen 0 “Screen0” 0 0
Screen 1 “Screen1” RightOf “Screen0”
Screen 2 “Screen2” RightOf “Screen1”
Screen 3 “Screen3” RightOf “Screen2”
Screen 4 “Screen4” RightOf “Screen3”
Screen 5 “Screen5” RightOf “Screen4”
InputDevice “Keyboard0” “CoreKeyboard”
InputDevice “Mouse0” “CorePointer”
Option “Xinerama” “0”
Option “BlankTime” “0”
Option “StandbyTime” “0”
Option “SuspendTime” “0”
Option “OffTime” “0”

EndSection

Section “Files”
FontPath “/usr/share/fonts/default/Type1”
EndSection

Section “InputDevice”
# generated from default
Identifier “Mouse0”
Driver “mouse”
Option “Protocol” “auto”
Option “Device” “/dev/input/mice”
Option “Emulate3Buttons” “no”
Option “ZAxisMapping” “4 5”
EndSection

Section “InputDevice”
# generated from data in “/etc/sysconfig/keyboard”
Identifier “Keyboard0”
Driver “keyboard”
Option “XkbLayout” “us”
Option “XkbModel” “pc105”
EndSection

Section “Monitor”
# HorizSync source: edid, VertRefresh source: edid
Identifier “Monitor0”
VendorName “Unknown”
ModelName “Acer S231HL”
HorizSync 30.0 - 80.0
VertRefresh 55.0 - 75.0
Option “DPMS”
EndSection

Section “Monitor”
# HorizSync source: edid, VertRefresh source: edid
Identifier “Monitor1”
VendorName “Unknown”
ModelName “PRT LCD 2008W”
HorizSync 31.0 - 83.0
VertRefresh 56.0 - 75.0
Option “DPMS”
EndSection

Section “Monitor”
# HorizSync source: unknown, VertRefresh source: unknown
Identifier “Monitor2”
VendorName “Unknown”
ModelName “ViewSonic VA912b”
HorizSync 0.0 - 0.0
VertRefresh 0.0
Option “DPMS”
EndSection

Section “Monitor”
# HorizSync source: unknown, VertRefresh source: unknown
Identifier “Monitor3”
VendorName “Unknown”
ModelName “Ancor Communications Inc ASUS MM17T”
HorizSync 0.0 - 0.0
VertRefresh 0.0
Option “DPMS”
EndSection

Section “Monitor”
# HorizSync source: unknown, VertRefresh source: unknown
Identifier “Monitor4”
VendorName “Unknown”
ModelName “Samsung SME2020”
HorizSync 0.0 - 0.0
VertRefresh 0.0
Option “DPMS”
EndSection

Section “Monitor”
# HorizSync source: unknown, VertRefresh source: unknown
Identifier “Monitor5”
VendorName “Unknown”
ModelName “Samsung SyncMaster”
HorizSync 0.0 - 0.0
VertRefresh 0.0
Option “DPMS”
EndSection

Section “Device”
Identifier “Device0”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:8:0:0”
Screen 0
EndSection

Section “Device”
Identifier “Device1”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:8:0:0”
Screen 1
EndSection

Section “Device”
Identifier “Device2”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:16:0:0”
Screen 0
EndSection

Section “Device”
Identifier “Device3”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:16:0:0”
Screen 1
EndSection

Section “Device”
Identifier “Device4”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:41:0:0”
Screen 0
EndSection

Section “Device”
Identifier “Device5”
Driver “nvidia”
Option “RegistryDwords” “PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerDefault=0x1; PowerMizerDefaultAC=0x1”
VendorName “NVIDIA Corporation”
BoardName “Quadro K2000D”
BusID “PCI:41:0:0”
Screen 1
EndSection

Section “Screen”
Identifier “Screen0”
Device “Device0”
Monitor “Monitor0”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-I-1: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

Section “Screen”
Identifier “Screen1”
Device “Device1”
Monitor “Monitor1”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-D-0: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

Section “Screen”
Identifier “Screen2”
Device “Device2”
Monitor “Monitor2”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-I-1: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

Section “Screen”
Identifier “Screen3”
Device “Device3”
Monitor “Monitor3”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-D-0: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

Section “Screen”
Identifier “Screen4”
Device “Device4”
Monitor “Monitor4”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-I-1: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

Section “Screen”
Identifier “Screen5”
Device “Device5”
Monitor “Monitor5”
DefaultDepth 24
Option “Stereo” “0”
Option “metamodes” “DVI-D-0: nvidia-auto-select +0+0”
Option “SLI” “Off”
Option “MultiGPU” “Off”
Option “BaseMosaic” “off”
SubSection “Display”
Depth 24
EndSubSection
EndSection

It’s awfully strange that your device enumeration is changing. That sounds like a system BIOS problem. If you run lspci not through grep, do you see other devices appearing or disappearing that might cause the enumeration change?

You can use “nvidia-xconfig --enable-all-gpus --separate-x-screens” to generate a new xorg.conf with the correct BusID entries, but it might not end up with the specific layout you’re looking for.

My first suggestion would be to see if there is a system BIOS update available that might make your bus topology more stable across reboots, or if there is some specific problematic device in the system that is failing to enumerate intermittently, that might be causing the bus IDs of other devices to change.

It is not “awfully strange”, far from it. And we want to be prepared for it.
The above configuration that we ship to customers is quite heavy (6 screens) and customers are susceptible to add PCI-e expansion cards with a varying degree of topology complexity (sub-bridges,…). I garantee you that in these circumstances, PCI enumeration will change sooner or later. This will result in a broken Xorg configuration as the BusID have changed.

Is there really no clean solution for this?
thx

I doubt there is going to be a clean solution. As Aaron mentioned this is not something that is usually going to change (unless someone is changing hardware). Seems like a very rare case where someone would need to worry about this.

I would probably just hack something up like a simple bash script that generate xorg.conf on boot based off a template using lspci and sed.

Is there a way to use gpu uuid i.e. via ~/.nvidia-settings-rc? We can query it with many different tools but how to use it:

$ nvidia-xconfig --query-gpu-info
Number of GPUs: 1

GPU #0:
  Name      : GeForce GTX 750 Ti
  UUID      : GPU-166d5f95-13dc-1d24-87e1-cb59a50b3832
  PCI BusID : PCI:2:0:0

$ nvidia-settings -q gpus

1 GPU on Tippawaara12:0

    [0] Tippawaara12:0[gpu:0] (GeForce GTX 750 Ti)

      Has the following names:
        GPU-0
        GPU-166d5f95-13dc-1d24-87e1-cb59a50b3832
$ nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-166d5f95-13dc-1d24-87e1-cb59a50b3832)

If it is possibly nvidia to add X config option option gpu_UUID “” or similar in xorg.conf that identify gpu not by a busID but UUID in future drivers. It would be helpful for in case of possible PCI enumeration changes(as we have now partition UUIDS in /etc/fstab instead of drive devs).

I have a solution (at least a partial one). I have 4 flat panels across 2 cards, no sli bridge.
I needed the ability to have just one xscreen with all 4 monitors, and have the proprietary drivers installed. Xcinerama would not work for me. It still wanted to instantiate 2 xscreens no matter what I did. Instead I used the grossly under-documented mosaic feature. and did away with xcinerama all together. (The other reason i needed just one xscreen is that XRANDR stops working with my setup when using xcinerama, even though it’s not supposed to, and Cinnamon desktop on Mint 17.1 dies when more than one xscreen) Mosaic fixed all of this for me, and I’m happy as a clam. Here’s my xconfig, please modify to suit your needs. It looks weird because there is only 1 device listed and only one screen listed, but it’s just the way it works. Everything is handled with the metamodes.

Here’s how to make it work for you, delete all of the GPU-“UUID” and leave everything else. (And please mind that some cards have unused internal connectors so yours might skip a number like mine did.). Run nvidia-settings and you should see “enable xcinerama” changed to “enable base-mosaic”. If you see this, go ahead and save your xconfig, overwriting instead of merging. You’ll have a freshly generated xconfig with all the gpu uuids listed in the meta modes. Then just use nvidia-settings to move your monitors around and you should be golden.

# nvidia-settings: X configuration file generated by nvidia-settings
# nvidia-settings:  version 331.20  (buildd@roseapple)  Mon Feb  3 15:07:22 UTC 2014

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
    Option         "Xinerama" "0"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    # HorizSync source: edid, VertRefresh source: edid
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "DELL P2414H"
    HorizSync       30.0 - 83.0
    VertRefresh     56.0 - 76.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Quadro K600"
    BusID          "PCI:3:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Stereo" "0"
    Option         "metamodes" "GPU-26a97366-bbca-8011-49a6-41afa4b5d1a4.GPU-0.DVI-I-1: 1920x1080_60 +2160+0 {rotation=left}, GPU-26a97366-bbca-8011-49a6-41afa4b5d1a4.GPU-0.DP-1: 1920x1080_60 +0+0 {rotation=left}, GPU-106bac86-12ef-254d-2203-820f4666527f.GPU-1.DVI-I-1: 1920x1080_60 +3240+0 {rotation=left}, GPU-106bac86-12ef-254d-2203-820f4666527f.GPU-1.DP-1: 1920x1080_60 +1080+0 {rotation=left}"
    Option         "MultiGPU" "Off"
    Option         "SLI" "off"
    Option         "BaseMosaic" "on"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

I don’t know if the Bus ID actually matters, as long one of the Bus ID’s is there. For example, the second card in mine is PCI:4 but I don’t actually have to list it.

Unstable PCI identifiers? Do you mean how udev has a possiblity of changing device ID’s per reboot? They should be static across reboots even with hardware changes because udev populates the PCI slots and it will always populate them in the same order. So if you deploy the same motherboard you can safely expect the same PCI identefiers.

This might not hold true between motherboard revisions or firmware revisions. You may have to test each configuration or just deploy the same stuff every time.

IDK if this still is an actual problem, but I probably have a solution.

First of all, it’s really common, when GPU numbers change after adding/removing a device. When I remove one of the six cards from my setup, their enumeration becomes completely different and things like per-gpu overcklocking become risky.

I solved the problem for my case – automated overcklocking based on UUIDs and settings from an INI config which looks like:

[ed171a40-d848-8676-9396-13b34cf98aff]
GPUTargetFanSpeed = 70
GPUGraphicsClockOffset[3] = -200
GPUMemoryTransferRateOffset[3] = 1700
...

[ac55e3ba-da65-96c4-03a2-94709ccc3fdc]
GPUTargetFanSpeed = 65
...

The solution is a simple Python script and regular expression which grabs the real UUIDs and checks the config for some corresponding settings.

The part you’re interested in is:

gpus_info = subprocess.getoutput("nvidia-settings -q gpus")
    matches = re.findall(
[b]     r"GPU-(?P<number>\d)"
        r"\s+?"
        r"GPU-(?P<uuid>\w{8}-\w{4}-\w{4}-\w{4}-\w{12})"[/b],
        gpus_info
    )
    cards = None if not matches else <b>{uuid: number for number, uuid in matches}</b>

    if cards is None:
        print("Cannot detect any cards via 'nvidia-settings -q gpus'")
        sys.exit(1)

As you can see, the “cards” dict will contain pairs like (gpu_uuid → gpu_number). Now, xorg.conf has the following part[s]:

Section "Device"
    Identifier     "Device0"

Where a from the “Device” is an above mentioned “gpu_number”.

So, the working solution for your problem could be:

  1. Hardcode/generate an enumerated setup with UUIDs and corresponding gpu numbers
  2. Write script that would check the real situation (nvidia-settings -q gpus OR nvidia-xconfig --query-gpu-info) on each start-up (systemd service required)
  3. Compare the situation to the one from the previously generated setup
  4. If needed (the were some changes in enumeration), regenerate the setup, regenerate xorg.conf, and [ask] for reboot
    Actually, if you ditch UUIDs and start account PCI IDs, the setup will be redundant – you can get all the information from the xorg.conf.

No, he means that when you add new gpu, enumeration of the previous ones can change (and it will), and xorg.conf with others stick to the enumeration, so in my case some card would get wrong overcklocking parameters, because of commands like:

nvidia-settings -a [gpu:<b><NUMBER></b>]/GPUMemoryTransferRateOffset[3]=1900