Multi-GPU Peer to Peer access failing on Tesla K80

Hi-
I am having trouble moving data directly from one GPU to another. Take the simpleP2P sample program that comes in NVIDIA toolkit under NVIDIA_CUDA-7.5_Samples/0_Simple/simpleP2P for example. It is failing with verification errors. It would seem my system meets all the specs for UVA and P2P Access/Transfers but I have not been able to get the two GPU’s in the K80 to transfer data directly to one another. I’m running on 64 bit Ubuntu 14.04.2 OS, I’m using CUDA 7.5 with the driver version 352.39. I have been able to copy data to each GPU from the host just not to each other directly.

Here is the output from simpleP2P program

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K80 (GPU0) supports UVA: Yes
> Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.12GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

Here is the output of nvidia-smi:

Tue Oct  6 13:56:44 2015
+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:84:00.0     Off |                    0 |
| N/A   36C    P0    57W / 149W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:85:00.0     Off |                    0 |
| N/A   29C    P0    74W / 149W |     22MiB / 11519MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

When I type nvidia-smi topo -m

GPU0    GPU1    CPU Affinity
GPU0     X      PIX     6-11,18-23
GPU1    PIX      X      6-11,18-23

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Any ideas why everything would seem to be working but the transfer from GPU to GPU?

Is your K80 installed in an OEM system that has been certified for K80? What system is it installed in?

It’s a custom build based on a Supermicro X10DRG-HT motherboard with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz processors

Can you make sure that the latest System BIOS is installed on that Supermicro motherboard? What system BIOS version is installed currently?

I get the following when I run dmidecode | less:

# dmidecode 2.12
SMBIOS 2.8 present.
135 structures occupying 5973 bytes.
Table at 0x000ED8A0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 1.0b
        Release Date: 01/09/2015
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 8192 kB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 5.6

I’ll check and see if they have a more recent version

There is a newer BIOS, please install it and re-try the test.

Finally got around to updating the BIOS and problem seems to still exist:

# dmidecode 2.12
SMBIOS 2.8 present.
135 structures occupying 5973 bytes.
Table at 0x000ED8A0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 1.0c
        Release Date: 05/20/2015
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 8192 kB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 5.6

simpleP2P run

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K80 (GPU0) supports UVA: Yes
> Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.12GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

Well I don’t think I have any other good ideas of what to check.

Could you run

nvidia-smi -a

and paste the results into this thread. Could you also run the following as root:

dmesg |grep NVRM

and paste the results into this thread.

Also if it’s not too much trouble, could you try updating the driver to this one:

[url]http://www.nvidia.com/Download/driverResults.aspx/90279/en-us[/url]

I realize Tesla K80 is not listed in the supported products list, but it will work with your Tesla K80.

I’ve run tests just like the one you have, on Tesla K80 in a OEM certified box (Dell C4130) and it works correctly. In my case, the OS is RHEL 6.2, I’m using CUDA 7.5, and my driver version is 352.41, which I can’t change.

So the remaining possibilities are the difference between your OS (Ubuntu) and mine (RHEL), or a hardware difference (platform) or a defect of some sort (in the platform, or the K80). I probably won’t be able to debug it remotely to further narrow it down.

Sure thing

nvidia-smi -a

==============NVSMI LOG==============

Timestamp                           : Fri Oct  9 09:41:42 2015
Driver Version                      : 352.39

Attached GPUs                       : 2
GPU 0000:84:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0321215003232
    GPU UUID                        : GPU-8fcde20c-90e0-e971-b1d5-958f1e653d39
    Minor Number                    : 0
    VBIOS Version                   : 80.21.1B.00.01
    MultiGPU Board                  : Yes
    Board ID                        : 0x8200
    Inforom Version
        Image Version               : 2080.0200.00.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102D10DE
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : PLX
            Firmware                : 0xF0472900
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 22 MiB
        Free                        : 11497 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 36 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 58.16 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 562 MHz
        SM                          : 562 MHz
        Memory                      : 2505 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

GPU 0000:85:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0321215003232
    GPU UUID                        : GPU-5e15df18-c47a-e283-25e7-784aa2930f1d
    Minor Number                    : 1
    VBIOS Version                   : 80.21.1B.00.02
    MultiGPU Board                  : Yes
    Board ID                        : 0x8200
    Inforom Version
        Image Version               : 2080.0200.00.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x85
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102D10DE
        Bus Id                      : 0000:85:00.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : PLX
            Firmware                : 0xF0472900
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 22 MiB
        Free                        : 11497 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 84 %
        Memory                      : 4 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 30 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 74.39 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 692 MHz
        SM                          : 692 MHz
        Memory                      : 2505 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

dmesg |grep NVRM

[   10.419771] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  352.39  Fri Aug 14 18:09:10 PDT 2015

Thanks in any case. I will work on trying to update driver and see what happens

I don’t see any other issues. All of the firmware versions in your K80 match mine. There are no errors reported by the driver in the system logs.

I’m a little bit suspicious that there may be a linux kernel setting that is affecting this, but I can’t put my finger on anything specific right now.

OK one more thing to check.

Can you go into the BIOS setup on that system, and see if there is an option to enable/disable PCIE ACS.

If so, the desire is to turn off ACS and see if it affects the behavior.

I cant seem to locate that in the BIOS. I’m not sure this system’s PCIE ports have that capability

If you want to continue, we can do some further investigation, but it will be a stepwise process.

If you want to continue, please run the following command as root, and post the results here:

lspci | grep -i plx

Based on those results, I should be able to give you the next command to run.

(P.S. This sort of activity generally is not necessary if you purchase a K80 in a qualified OEM config.)

Sure:

lspci | grep -i plx
82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)

Please run the following commands as root, then post the results here:

lspci -s 83:08.0 -vvvv | grep -i acs

and

lspci -s 83:10.0 -vvvv | grep -i acs
sudo lspci -s 83:08.0 -vvvv | grep -i acs
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
sudo lspci -s 83:10.0 -vvvv | grep -i acs
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans

Would it make a difference if I used two titan black, or two Tesla K40c GPU’s in this system for P2P transfers and access? Is this just an issue of using a K80 with this motherboard?

I apologize for dropping the ball here. Not sure why I didn’t see your last 2 updates.

If you are still pursuing this, I would like to first point out that the only permanent solution/fix would be to get an updated BIOS from the system vendor that fixes this issue. Supermicro is certainly aware of the underlying issue here as they have applied the necessary fix via BIOS update to some of their other GPU-enabled products.

Anyway, to proceed with the process, you would need to disable ACS on the motherboard, and re-test the simpleP2P test (without rebooting). The steps to disable ACS would be:

setpci -s 83:08.0 f2a.w=0000
setpci -s 83:10.0 f2a.w=0000

Note that this will probably require root privilege. You can then re-verify the changed settings by running the previous lspci commands:

sudo lspci -s 83:08.0 -vvvv | grep -i acs
sudo lspci -s 83:10.0 -vvvv | grep -i acs

at which point for each you should see reported the ACSCtl line with all negative settings:

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

At this point you should re-run the simpleP2P test to see if the verification errors are still reported.
This change is not permanent: a reboot will restore the machine to the previous state.

If this fixes the issue, then you can leave it as-is if you wish, or else provide this information to your system vendor. They can advise you on the availability of a SBIOS with such a fix.

If it does not fix the issue, I am out of ideas, and it is probably best to refer back to your system vendor.

Regarding your additional question, I suspect that if you used two Tesla K40c GPUs, for example, you would not see this issue, and P2P transfers would “just work” but I am just guessing on that. You would have to try it to be sure.

A related supermicro faq entry is here:

http://www.supermicro.com/support/faqs/faq.cfm?faq=20732

Hi,

we are facing similar issues on the M60 and the K80 both on Supermicro. Your suggested fix worked on our M60 GPUs but not on the K80. On the K80, it passes the simpleP2P test but gets stuck on the p2pBandwidthLatencyTest (you see a Kernel oops in dmesg and only a hard reboot fixes the problem) . ACS is turned off, BIOS up-to-date. I’m unsure what else to try…

Any more ideas?

Thanks,
-Florian

Have you updated the system BIOS on the affected platform to the latest available from Supermicro?

If so, then you should probably contact Supermicro to ask for assistance with the K80.