Clarification on cudaMemAdviseSetReadMostly?

cudaMallocManaged allocates memory on both host & device, and keeps them synced.

  1. In the docs for cudaMemAdvise:
    cudaMemAdviseSetReadMostly: This implies that the data is mostly going to be read from and only occasionally written to. Any read accesses from any processor to this region will create a read-only copy of at least the accessed pages in that processor’s memory.

If I’m reading this right, this means calling cudaMemAdvise with cudaMemAdviseSetReadMostly), then read-accessing the memory on the device, results in two copies of the data on the device. Is this correct?

  1. Again from the docs:
    Additionally, if cudaMemPrefetchAsync is called on this region, it will create a read-only copy of the data on the destination processor.

Is the same copy as caused by the read-access? Or is it an additional (third) copy)?
(I’m assuming the former, but the docs aren’t 100% clear.)

Only one copy of the data is made on the device, in any scenario.

Ok.

To follow up:

The scenario I work with is:
Many 2MB memory blocks (cudMallocManaged). Initially populated on host, then prefetched. Rarely written to by the host. Rarely read by host. Prefetches follow the rare writes. Device only reads them.

Can memadvis’ing provide any benefit in this scenario?

mem advising may help in a scenario where both host and device are accessing the data. If, during the duration of a particular kernel execution, the host never touches the data (neither read nor write) and the device only reads the data (or read+write) I don’t see mem advising helping much (although prefetching will help). There may be some corner cases such as multiple GPU systems and other things I haven’t considered. And I’m making a bunch of assumptions here such as this is on linux, on a cc6.0 or higher GPU, and on CUDA 9.0 or 9.1

Usually the best bet is just to benchmark your particular use-case.