cudaHostRegister/Unregister slow with recent drivers

I’ve noticed that cudaHostRegister and cudaHostUnregister have become much slower with recent driver versions (compared with 320.57 from CUDA 5.5). On my Intel Core i7 3770T (running Windows 7 64bit) the total time for the two calls seems to be about 0.2ms per MB. Is this something to do with the introduction of Unified Memory in CUDA 6.0? Is it going to be improved?

Actually, having done some more detailed tests it appears that it wasn’t actually that much faster with 320.57 drivers (it was maybe 30% faster). What caught me out is that some of my other setup costs have also increased with the new drivers. I’m looking at an increase of 30ms for the cudaHostRegister plus another 70ms either in cudaMalloc() or for initializing the device.

I also have a slightly different but related question. One way I could reduce the time taken in cudaHostRegister is by registering a smaller piece of memory and re-using it. However, I encounter a problem when my program is transferring data constantly (overlapped with execution). With two cards running at PCI-Express x8 3.0 I get about 6GB/s on each. That needs 12GB/s of host memory bandwidth. If I also have to constantly copy data from pageable to pinned memory then I need another 24GB/s of memory bandwidth (making a total of 36GB/s). That’s more than my current platform can provide. Is there a better way to keep the PCI-Express bus saturated?

If you observed reproducible, significant, regressions in allocation speeds (30% strikes me as significant), I would suggest filing a bug via the bug reporting form linked from the registered developer website.

In practical terms, it would be best to avoid frequent allocation and re-allocation and instead re-use existing allocations as much as possible. That advice applies independent of platform. By the same token, minimize data copies (whether on the host, the device, or between host and device). If the system is multi-socket, try using CPU and memory affinity controls to make sure copies target the “near” memory.

If your host data is too big to reside in pinned memory completely, copies to and from a pinned memory buffer will be necessary as DMA requires physically contiguous addresses and thus needs a pinned host buffer. Not sure what your options are, but switching to a host platform that uses four channels of high-speed DDR3 should be able to provide close to 50 GB/sec of useable system memory bandwidth (with each channel supplying 12.5 GB/sec theoretical bandwidth).

Thanks, I probably will submit a bug report when I’ve taken the time to collect some reliable measurements.

Technically my problem isn’t that the data is too big to reside in pinned memory completely. My problem is that I only need to transfer the data once and pinning the memory in the first place seems to take as long as actually doing the transfer. If I understand it correctly the driver has to send a list of page addresses over the PCI-Express bus to the GPUs. But the size of this list should only be a small fraction of the size of the memory being pinned.

In the meantime, at work I have a workstation with dual Xeons (each with four channels of DDR3) so I guess I’ll start using that.

Here is another interesting observation. If I allocate the memory I want to pin using VirtualAlloc with the MEM_LARGE_PAGES flag then the cudaHostRegister call is 10x faster!