PCI Bandwidth

Hi,

I am new to CUDA programming and my first tests are a few simple benchmarks. I have been puzzled with some benchmarks where I copy data from host to device and device to host where it seems I only have about 3 GB/s of bandwidth.

My hardware is as follow:

  • Dell Precision 7820 tower
  • Dual Xeon Skylake 6150, 18 cores per socket, 6 DIMMs of 2666 MHz DDR4 memory per socket, for a total of 12 x 8 = 96 GB
  • Nvidia P1000
  • Nvidia Titan V

The P1000 is used as a graphic card where 2 screens are connected and the Titan V is used for number crunching and is the one I use when I only get 3 GB/s. As I am new to PCI devices, and I am the one who plugged in the Titan V, my guess is that I did something suboptimal here. On my machine, I get 5 PCI slots that say:

  • Slot 1: PCIe3x16 (8, 4, 1)
  • Slot 2: PCIe3x16 75W <---- P1000
  • Slot 3: PCIe3x16 (1)
  • Slot 4: PCIe3x16 75W
  • Slot 5: PCIe3x16 (4, 1) <---- Titan V

The arrows show where my GPU are currently plugged in. What should I do to get a better PCI bandwidth on the Titan V?

It seems to me your Titan V should go into slot 4 (since slot 2 and slot 4 appear to be the only slots with full x16 interface based on the information provided). You are currently getting 3 GB/sec because the Titan V is in a x4 slot.

Thanks for your help.

I have changed the Titan V card which is now in the Slot 4. Now the results give about 5 GB/s for the bandwidth to the device and for the bandwidth from the device. But from time to time, I get about 10 GB/s for one or the other.

I guess, I have NUMA effects because I have a Dual Socket Workstation. Can anyone confirm this? If so, how do I allocate the memory to the RAM that is closer to the GPU?

What if the best bandwidth I should get when everything is ok on such a configuration?

The Skylake 6150 CPU is listed at Intel ARK as providing 48 PCIe lanes, which should be sufficient to drive two GPUs at full PCIe gen3 x16 speed of about 12 GB/sec, provided they are in the correct slot.

It is not clear how well controlled your experiments are. Is the performance data you quoted from the bandwidthTest app that ships with CUDA? Are you using transfers from and to pinned host memory? If not (i.e. you use pageable host memory), each transfer across PCIe also triggers an additional system-memory to system-memory transfer, which will skew results towards lower numbers.

Also, PCIe data transport is packetized, meaning you will get better throughput for larger transfers; often it takes a transfer size of >= 8 MB to reach peak throughput.

Yes, NUMA issues in dual-socket systems can reduce effective throughput. You would want each GPU to talk to the “near” CPU with it’s attached “near” memory to maximize throughput. Use numactl to bind CPUs and memory appropriately.

  • I wrote by own program where I transfer 1 GB of memory. This is basically:

const std::size_t kilo = 1024;
const std::size_t n = kilo * kilo * kilo;
unsigned char* p = new unsigned char[n];
for (std::size_t i = 0; i < n; ++i) {
p[i] = 0;
}
unsigned char* q = cudaMalloc((void**)q, n);
// StartTimer
cudaMemcpy(q, p, n, cudaMemcpyHostToDevice);
// StopTimer

  • I have just found the bandwidthTest program in the sample directory. It gives me 12 GB/s from host to device and 13 GB/s from device to host. So I guess everything is fine now and I need to learn all those pinned memory stuff.

Thanks for your help.