Are GPU allocated pointers unique?

RS64 · January 3, 2018, 8:48pm

Perhaps a simplistic question, but when using non-unified memory, e.g.

void* devPtr = nullptr;
cudaError_t result = cudaMalloc( &devPtr, size );

Is the resulting device memory pointer unique to only the GPU, or to the entire memory system?

In other words, is it possible to ever have a the same address returned by both standard malloc() and cudaMalloc()?

What about if I have 2 GPUs? is it possible for a cudaMalloc() to return the same memory address when allocating memory separately on each GPU?

Robert_Crovella · January 3, 2018, 9:04pm

In a UVA setup, it is not possible for two pointers (belonging to separate allocations), whether CPU or GPU or one of each, belonging to the same (CPU/OS) process, to have the same numerical value.

[url]CUDA Runtime API :: CUDA Toolkit Documentation

In a non-UVA setup, all bets are off. non-UVA CUDA is going the way of the dodo.

RS64 · January 3, 2018, 9:19pm

Thanks txbob,

I agree, it would be nice to get away from the manual memory management of old and stick with the unified approach, it is much simpler. For the time being we are stuck on a Windows platform. Using unified memory, the kernel performance is brutal, specifically due to the initial H->D xfer. With the inability to pre-fetch data prior to executing a kernel, we are seeing unified memory transfer speeds 6x-10x slower than if we use the old style H->D xfer. The behaviour is consistent across a variety of systems and cards (M5000, M6000, GTX 1080, GTX 1080Ti). Right now I don’t see a better way to solve this other than using the old style memory allocation and transfer. If you have suggestions on how to improve unified memory performance, by all means, pass them along.

njuffa · January 3, 2018, 9:29pm

If you are going for maximum performance, that may well be the correct decision for quite some time to come. This is just the same effect one gets with all kinds of “convenience features” in the computer worlds whether they be caching, virtual memory, branch prediction, etc, etc.

For maximum performance, a programmer can exploit detailed knowledge of control flow and data movement patterns, while an automated mechanism can at best guess intelligently (uses a lot less information). An automated mechanism may work well for 80% of cases, and misbehave spectacularly for a small percentage of cases. I am hoping that the introduction of deep learning techniques can give us sizeable improvements in minimizing the impact of worst-case behavior (e.g. thrashing) from automated mechanism over the next decade.

Robert_Crovella · January 3, 2018, 9:29pm

You’re confusing Unified Memory (UM) with Unified Virtual Addressing (UVA or UA). Please don’t. Please actually read the link I provided. It has approximately nothing to do with Unified Memory.

RS64 · January 4, 2018, 1:31pm

Guilty as charged. You are correct I was confusing the two. For anyone else reading this, this is a good reference as well:

https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf