Titan X Pascal scaling with 4 cards ... problems?

Hi All,

I got to do quick test on a system with 4 new Titan X cards. (Motherboard is ASUS X99-E WS with PLX chip for 4 x X16 …)

Setup with CUDA 7.5 and 8.0rc on Ubuntu 16.04 and driver 367.35

For a fist test I used the nbody code from the samples. This usually scales OK on multiple GPU’s but it is very odd on the new Titan X (similar results with cuda 7.5 and 8)

Using nbody -benchmark -numbodies=256000 -numdevices= …
Each card individually gives around 7900 GFLOP/s as expected but multiple cards do this:
(1) 7828 GFLOP/s
(2) 7125
(3) 13500
(4) 14689

I have had results as low as 1200 GFLOP/s using 2 cards! It’s inconsistent. I have used different sets of 4 cards too to rule out flaky cards. (These results are with all 4 cards installed)

1080’s worked fine on the same board and install:

(1) 5441 GFLOP/s
(2) 8202
(3) 13060
(4) 16239

I just wanted to put this out there as a heads-up that there may be problems. I suspect it may be a driver issue???

Everything I have done with 1070’s and 1080’s has been trouble free and with fantastic performance.

… updated to driver 367.44 … no change

I just posted what I believe is the same issue here:

https://github.com/NVIDIA/nccl/issues/44

I just updated to the beta 370 driver and it appears to be fixed. Doing more extensive testing now.

yes, could be the same issue … I’ll try 370 too and see what happens … nope, didn’t help here.

I installed the 370.23 driver rebuilt with both cuda 7.5 and 8

CUDA 8 build gave this:
(1)
1 Devices used for simulation
= 7440.199 single-precision GFLOP/s at 20 flops per interaction
(2)
2 Devices used for simulation
= 4721.444 single-precision GFLOP/s at 20 flops per interaction
(3)
3 Devices used for simulation
= 7935.839 single-precision GFLOP/s at 20 flops per interaction
(4)
4 Devices used for simulation
= 11814.144 single-precision GFLOP/s at 20 flops per interaction

Not so good … puzzling!

Thanks for your input --Don

… I have booted up a dual socket machine with 4 full X16 slots (and no PLX chip) … results are better but still very strange when running on just 2 of the 4 cards.

nbody -benchmark -numbodies=256000 -numdevices= …

(1) 7507 GFLOP/s
(2) 9096
(3) 15130
(4) 21706

With only 2 cards in the system the 2 GPU result is still poor
(1) 7568
(2) 9347
–Don

Whenever dual-socket systems are used, make sure to carefully the control memory and CPU affinity (e.g. with numactl), so each GPU talks to the “near” CPU and “near” system memory.

Note that not all CPUs can provide the >= 32 PCIe lanes that are needed to drive two GPUs from one socket at full PCIe gen3 x16 rates. What CPU(s) are you using?

Depending on the intensity of memory traffic between GPUs and the system, system memory could possibly put the brakes on, so it would probably be best to use a fast DDR4 configuration. Two Titan X coupled to one CPU socket could provide up to 50 GB/sec of memory traffic when operating a maximum full-duplex throughput. I am speaking theoretically here as I don’t have hands-on experience with dual socket machines with four Titan Xs.

…yes, … but this is an ASUS D8 WS board full X16 support for 4 cards and 2 x E52690v4 CPU’s etc…
this simple sample code should should work fine. I mainly wanted to see if there was some issue specific to the ASUS X99-E WS board since it is sometimes problematic. (and the default DIGITS box motherboard)

The biggest puzzle is this: Why is scaling so bad with 2 TitanX Pascal cards?

I have tried 1, 2 and 4 card setups on good single and dual socket MB’s with GTX 1070, 1080 and TitanX (Pascal) cards. The 1070 and 1080 scale as expected but the TitanX is very odd and inconsistent.

I’m mostly putting this out there so people can find it if they are running into this. Hopefully someone better versed and with more time than me will have some enlightening info.

I definitively will be doing more testing! There are a lot of people ordering systems right now for a variety of “machine learning” tasks. They have been waiting for Titan X and are ordering with 4 card setups … and doing important work! I am very concerned!

I will try to keep this thread alive with more info. Any comments are appreciated. Thanks! --Don

You could try running nvidia’s nccl benchmark:

Or try the p2pBandwidthLatencyTest that comes with the cuda samples.

Also, I believe you have 2 PCIe switches on that X99-E WS board, not 1. I think the layout is like slide 24 shown here:

Cards on the same switch should communicate faster so that could be a factor. Perhaps experiment with different slots.

The D8 WS is an older board and the driver is likely not well tuned for that layout. Also, the specs I find says it runs at 8x PCIe with 4 cards installed.

… the D8 board I’m referring top above is the nice one :-) Z10PE-D8 WS full 4 way X16

Thanks for the links! The nccl stuff look very interesting, I’ll try the nccl benchmark

I will try p2pBandwidthLatency test right now …

(this sys (X99-E WS) may have a problem on pciBusID 9 I have had “GPU has fallen off the bus” for this ID …doing hardware debugging. That is unfortunately complicating the scaling issue which seems to be consistant across another X99 E WS board and the Z10PE-D9 WS )

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN X (Pascal), pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN X (Pascal), pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN X (Pascal), pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN X (Pascal), pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Cliques:
[0 1 2 3]
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 171.63 10.01 11.23 11.22
1 10.02 173.61 11.13 11.19
2 11.25 11.23 173.78 10.00
3 11.22 11.21 10.04 174.23
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 173.77 13.11 10.21 10.21
1 13.11 173.61 10.21 10.21
2 10.37 10.37 173.46 13.46
3 10.37 10.37 13.46 173.77
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 175.88 10.53 18.26 17.74
1 10.59 175.18 17.89 17.65
2 18.26 17.80 175.33 10.48
3 17.95 17.89 10.58 175.80
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 175.17 25.27 19.72 19.71
1 25.28 175.40 19.72 19.73
2 19.78 19.76 176.20 26.09
3 19.75 19.74 26.07 173.30
P2P=Disabled Latency Matrix (us)
D\D 0 1 2 3
0 2.47 15.51 15.60 15.69
1 14.90 2.49 14.84 14.98
2 15.13 15.45 2.47 15.71
3 15.13 15.27 15.51 2.44
P2P=Enabled Latency Matrix (us)
D\D 0 1 2 3
0 2.46 15.59 15.54 13.66
1 13.92 2.45 15.14 13.59
2 13.91 15.38 2.53 13.61
3 14.14 15.39 15.21 2.49

running nccl everything looks OK

./build/test/single/all_reduce_test 10000000
# Using devices
#   Rank  0 uses device  0 [0x0a] TITAN X (Pascal)
#   Rank  1 uses device  1 [0x09] TITAN X (Pascal)
#   Rank  2 uses device  2 [0x06] TITAN X (Pascal)
#   Rank  3 uses device  3 [0x05] TITAN X (Pascal)

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
    10000000      10000000    char     sum    1.679   5.96   8.94    0e+00    1.693   5.91   8.86    0e+00
    10000000      10000000    char    prod    1.665   6.01   9.01    0e+00    1.675   5.97   8.95    0e+00
    10000000      10000000    char     max    1.679   5.96   8.93    0e+00    1.693   5.91   8.86    0e+00
    10000000      10000000    char     min    1.677   5.96   8.95    0e+00    1.673   5.98   8.97    0e+00
    10000000       2500000     int     sum    1.667   6.00   9.00    0e+00    1.685   5.93   8.90    0e+00
    10000000       2500000     int    prod    1.674   5.97   8.96    0e+00    1.680   5.95   8.93    0e+00
    10000000       2500000     int     max    1.673   5.98   8.96    0e+00    1.690   5.92   8.88    0e+00
    10000000       2500000     int     min    1.672   5.98   8.97    0e+00    1.685   5.94   8.90    0e+00
    10000000       5000000    half     sum    1.668   6.00   9.00    4e-03    1.673   5.98   8.97    4e-03
    10000000       5000000    half    prod    1.659   6.03   9.04    1e-03    1.677   5.96   8.95    1e-03
    10000000       5000000    half     max    1.655   6.04   9.06    0e+00    1.669   5.99   8.99    0e+00
    10000000       5000000    half     min    1.654   6.05   9.07    0e+00    1.673   5.98   8.97    0e+00
    10000000       2500000   float     sum    1.680   5.95   8.93    5e-07    1.684   5.94   8.91    5e-07
    10000000       2500000   float    prod    1.682   5.95   8.92    1e-07    1.688   5.92   8.88    1e-07
    10000000       2500000   float     max    1.666   6.00   9.00    0e+00    1.686   5.93   8.89    0e+00
    10000000       2500000   float     min    1.683   5.94   8.91    0e+00    1.697   5.89   8.84    0e+00
    10000000       1250000  double     sum    1.676   5.97   8.95    0e+00    1.679   5.95   8.93    0e+00
    10000000       1250000  double    prod    1.676   5.97   8.95    2e-16    1.687   5.93   8.89    2e-16
    10000000       1250000  double     max    1.670   5.99   8.98    0e+00    1.680   5.95   8.93    0e+00
    10000000       1250000  double     min    1.688   5.92   8.88    0e+00    1.702   5.88   8.81    0e+00
    10000000       1250000   int64     sum    1.680   5.95   8.93    0e+00    1.692   5.91   8.87    0e+00
    10000000       1250000   int64    prod    1.673   5.98   8.97    0e+00    1.682   5.95   8.92    0e+00
    10000000       1250000   int64     max    1.676   5.97   8.95    0e+00    1.685   5.93   8.90    0e+00
    10000000       1250000   int64     min    1.666   6.00   9.00    0e+00    1.678   5.96   8.94    0e+00
    10000000       1250000  uint64     sum    1.670   5.99   8.98    0e+00    1.683   5.94   8.91    0e+00
    10000000       1250000  uint64    prod    1.676   5.97   8.95    0e+00    1.680   5.95   8.93    0e+00
    10000000       1250000  uint64     max    1.678   5.96   8.94    0e+00    1.685   5.94   8.90    0e+00
    10000000       1250000  uint64     min    1.660   6.02   9.03    0e+00    1.682   5.94   8.92    0e+00

In case you’re curious of the speed of a 4-way PLX switch (instead of 2 2-way switches), I get these numbers:

all_reduce_test 10000000 4 0 1 2 3
#      N    type      op     time  algbw  busbw
10000000    char     sum    1.260   7.94  11.90

Still not sure what to make of the nbody numbers. You could try running it under the profiler to see where the slowdown is.