How to transfer massive data efficiently?

Hello, forum!

I’m studying GPU’s for my MsC Degree and i need learn a way to transfer a larger amount of megabytes (9GB, i suppose) from host memory to GPU memory fast way. But i think which the PCI-Bus bandwidth and memory heap of application are the bottlenecks. Correct? Let us see. My current implementation is described follow:

  • My database is a integer array in hard drive;
  • To copy this database to GPU, i need to copy the maximum of bytes can i store in heap memory of application, and i’m repeating this step until copy all of my array to GPU memory;
  • I’m not using no paralellism to copy data to GPU;

Considering my case and some trade-offs, have any optimization which i should implement to get the best bandwidth in data transfer to GPU memory? Pinned memory brings performance increase with this amount of data? I can use paralellism to copy more than one chunk array to GPU memory? Increase the heap size with nvcc it’s possible? Anybody had any of these problems?

Thanks for all.

First off, 9GB of data will not fit into the on-board memory of most GPU cards. Note that on card with dual GPUs (such as Tesla K80) each GPU is coupled to its own memory, the memory on the card is not in a single unified pool used by both GPUs.

From a performance perspective, it will very likely be better to partition your data: transfer it to the GPU (and back) in chunks that allow to balance to kernel execution time and the copy time for best possible overlap. You will need to use CUDA streams and asynchronous copies. This allows you to build a pipeline in which CPU processing, GPU processing, and host<->device copies occur concurrently.

Copies from pinned memory are faster than copies from pageable memory. DMA transfers from and to the GPU need to be performed from contiguous physical memory, so use of pageable memory introduces an additional copy on the host system side from user space to a driver-maintained pinned DMA buffer. PCIe transfers incur a certain amount of fixed overhead, you would want to transfer data in blocks of 32 MB or larger to maintain best possible PCIe throughput.

With a GPU with a 16x PCIe gen 3 interface, you can transfer data at about 12 GB/sec for these large blocks. Since PCIe is a full duplex interconnect, you can transfer simultaneously at that speed in both directions, if you have a GPU with dual copy engines (dual DMA engines). Note that your host system memory could become a limiting factor in a scenario involving simultaneous up/down transfers, as the usable bandwidth of two-channel DDR3 configurations is about 25 GB/sec. If you have a server-class host system with four DDR3 channels that will eliminate such a bottleneck. DDR4 system memory will give you another 20% in additional bandwidth. If the host is a multi-socket system, make sure to use CPU and memory affinity settings so the GPU always “talks” to the “near” CPU, otherwise latency and bandwidth will supper due to CPU-toCPU interconnect.

pinned memory has a time cost to pin the memory. As a rule of thumb, this time cost is approximately the 50% of the time it takes to do one transfer from that (unpinned) memory, and pinning gives approximately a 2x speed boost to transfers.

Therefore pinning a buffer and using it for one transfer (only) is rarely that beneficial in terms of overall application execution time (the transfer itself will run twice as fast, but this benefit is offset by the time it takes to pin the buffer.) You may still want to pin the memory anyway, even if only using it for one transfer, as pinned memory is required for overlap of copy and compute operations.

However as njuffa states, this size of data set is begging to be broken into chunks, so that you can overlap copy and compute. In that case, it’s likely that you will re-use buffers for data transfers across multiple chunks. In that case, pinning memory is likely to give a net benefit to overall application execution time.

I note that in some of your other postings elsewhere you have an interest in using thrust. It’s now possible with thrust v1.8 (ships with CUDA 7) to build applications that overlap copy operations with thrust compute operations. Here are two fully worked examples:

[url]c++ - How to asynchronously copy memory from the host to the device using thrust and CUDA streams - Stack Overflow

[url]Getting CUDA Thrust to use a CUDA stream of your choice - Stack Overflow

Thanks, txbob and njuffa!

txbob, my experience using thrust was only for benchmarking.

About the heap memory: i can increase the byte size and bring me any advantage?

Something I noticed was that if you can enable large memory pages then you can dramatically reduce the cost of pinning memory.

True. I think currently this is fairly atypical for a vanilla linux install, but it is correct. The pinning cost scales with the number of pages to be pinned (each must be pinned individually.) Therefore the same memory size composed of fewer pages will have a lower pinning cost.