GTX 1080 Ti performance is poor

Hi,
after installing CUDA on a CentOS 7.2 server with GTX 1080 Ti GPU the performance is very poor, significantly slower than the GTX 1080 GPUs on another server.
The bandwidth test shows following (1080 Ti GPU):

sudo /usr/local/cuda-8.0/samples/1_Utilities/bandwidthTest/bandwidthTest

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Graphics Device
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3906.3

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2650.9

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 343047.4

Result = PASS

1080 GPU on an other server shows following:

sudo ./bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 1080
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12439.3

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12879.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 231598.8

Result = PASS

Device to host and host to device values are significantly lower for the Ti card.
Any idea for the reason?

These numbers will be affected by the type of slot the card is plugged into (e.g. x8, x16, gen1, gen2, gen3, etc.)

You can get an idea of what type of slot the card thinks it is plugged into by running:

nvidia-smi -a

and looking at the sub-section “GPU Link Info”

Your 1080 GPU shows what I would expect for a x16 gen3 slot
Your 1080Ti seems to be something like a x8 gen2 slot, or x16 gen 1 slot.

Hi Txbob,
thanks for your quick response. nvidia-smi -a shows gen3 X16 PCI slot:

GPU Link Info
PCIe Generation
Max: 3
Current: 3
Link Width
Max: 16x
Current: 16x

Performance State: P5

I’m seeing fairly similar results. I’m using Windows 10 with the 387.92 driver. I’ve confirmed that the GPU is on PCIe Gen 3 x16.

For me, the host to device is correct, but device to host seems significantly slower.

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     <b>12808.9</b>

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     <b>5654.1</b>

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     <b>363164.3</b>

The pinned memory device to host bandwidth is less than half of what it should be, and is also slower than device to host using pageable memory.