First impressions of CUDA 6 managed memory

I’ve started experimenting with CUDA 6 managed memory the past few days, and wanted to share a few notes for those who are curious about this new feature. (My tests have been with the CUDA 6 release candidate, and a GK208 and GM107 card in a PCI Express 2.0 motherboard running Ubuntu 13.04. To streamline testing, I’ve been using my CUDA 6 fork of PyCUDA, so YMMV, but hopefully not very much.)

First some qualitative observations (some of this can be deduced from the manual but not everyone can read that yet):

  • You should think of managed memory as device memory that you can access from the host, rather than host memory you can access on a device. (The driver API explicitly treats it as device memory.) As a result, managed memory is bound to the GPU that is active when you create it, and data does not migrate between GPUs.
  • When you allocate managed memory, the driver allocates memory on both the host and device at the same time. If you check your device free memory immediately afterward cudaMallocManaged, you will see the free memory reduced by the amount you allocated.
  • As a result of the previous point, managed memory cannot be used to access arrays in host memory that do not also fit in device memory. I had briefly imagined that managed memory would allow me to treat host memory as a page buffer backing arrays larger than the device memory, but that doesn't work.
  • They are not kidding about the concurrency limitations. Fail to synchronize with the appropriate streams before accessing managed memory on the host, and *BOOM* segfault (just like you dereferenced a bogus address).

    This makes CUDA bindings for managed memory a bit awkward in languages with a runtime that tries to protect you from memory access violations (like Python). I currently don’t have any good strategy to prevent the Python interpreter from segfaulting completely if host access happens at an inappropriate time.

And some more quantitative observations from microbenchmarking with strided, linear access patterns on a 400+ MB contiguous array:

  • Managed memory transfer from host to device (presumably triggered by a page fault on the device) appears to copy the entire array, regardless of how much of the array is accessed on the device. The time spent transferring the data is about 30% longer than doing an explicit cudaMemcpy on the Kepler GK208 and 70% longer on the Maxwell GM107.
  • Managed memory transfer from device to host (initiated by page fault on the host) behaves very differently. The transfer from device to host takes 6x longer on the GK208 and 8x longer on the GM107 compared to cudaMemcpy. However, unlike the host-to-device direction, the transfers appear to have 4096 byte granularity (the VM page size on Linux). This means that you can see a big improvement with implicit transfers compared to a full cudaMemcpy if you sparsely access a managed data structure on the host after a kernel processes it.

Based on the above, it sounds like both Kepler and this first release of Maxwell hardware lack the virtual memory capability on the device required to transfer individual pages of memory from the host on demand. The host CPU can (and does) transfer individual pages from the device as they are needed, but the ability comes at a pretty high bandwidth efficiency cost. Interestingly, GM107 seems to actually have lower managed memory performance than Kepler at the moment.

Has anyone else had time to play with managed memory?

One followup:

If 2 GPUs are present in the system, and they are not capable of accessing each other’s memory, managed memory allocations will behave just like pagelocked, zero copy host memory. You have to use the CUDA_VISIBLE_DEVICES environment variable to limit your program to one CUDA device in order to avoid this.

This is explained in the CUDA C Programming Guide, but it still confused me when I tried running both the GK208 and GM107 in the same system and all my benchmarks changed.