Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100

My understanding is that trying to allocate more than the 16gb on my P100 should “just work” using cudaMallocManaged on Pascal (sm_60, CUDA 8.0). But I get the API error “out of memory” - am I missing something extra I need to do?

What operating system? How much system memory?

Ubuntu 16.04, 256gb system memory. It failed when trying to allocate a 26gb array.

I’ve not had any trouble with this.

I am wondering whether some OS configuration setting could be at fault, but wouldn’t know what to look for other than “everything related to memory management”. Is the installation customized or mostly using default settings?

Not customized at all, so far as I know - the clustering people who built the machine set it up, but nothing non-standard that I know of.

I’ve been suspicious of CUDA installation issues since nvprof has been giving me issues wrt seg faults (which I posted about a while back to no avail). I would try reinstalling now but I urgently need to use the device… eventually, though.

It could be a busted CUDA installation, especially if you used a package installer instead of the runfile installer method, which has not failed me once and which I would recommend. Note that package installer and runfile installer approaches don’t mix (well), so you would want to stick with either one or the other.

As for the OS, I am totally biased because I have seen so many issue with Ubuntu over the years. A Linux distro produced by people who, ah, think differently. My opinion: “Friends don’t let friends use Ubuntu.”

Well, the cluster folks installed it - part of the reason I want to reinstall, so I can do it myself.

Well, my friends use Ubuntu, and I’m not certain I’ll ever care that much about which Linux distro I’m using :)

As far as I can see, Ubuntu (thanks to their LTS releases) is the only non-paid-for Linux distribution that is both supported by CUDA 8.0 and still supported by it’s vendor.

This has forced me to abandon my favourite distro for Ubuntu on work-related machines (even though I personally would be happy to pay a distributor for the service of providing (security-) updates).

I might try out oversubcribing memory unified memory on CC 6.x devices under Ubuntu 16.04 at some point, but due to distro-related problems I currently have it might take me several days until I get around doing it.

Just re-installed CUDA using the runfile method… absolutely no difference. Can’t oversubscribe, and nvprof still gives me the damn segmentation fault errors while the program alone (and memcheck) work fine.

How fast are non-paid-for Linux distros dropping support these days??

I am highly skeptical that the current “versionitis” that seems to have gripped the software industry as a whole is doing anybody any good.

About two years maximum. Which leaves about six months after they start being supported by CUDA.

Note I’m not blaming anyone. You get what you pay for (and a lot more indeed).

I looked into this a bit further, and what’s really going on is that I can’t allocate more memory than the device’s amount to a single array. I can oversubscribe as long as each array is < 16GB. Is this expected behavior? Is there any way around this?

I would suggest you wait for an answer from a more knowledgeable person for an authoritative answer, but the limitation you describe seems plausible to me. Since managed memory needs to be mapped into both of the CPU and the GPU address spaces, and the GPU only has 16 GB to map, a single allocation cannot exceed the physical size of the GPU memory (probably a little bit less as CUDA itself needs some memory).

Initially, managed memory moved an entire array to CPU/GPU as required. This automatically limits allocation to VRAM size. I don’t remember whether they already implemented paging in P100 (and which drivers it require?) or we need to wait Volta?

The Programming Guide Section on GPU Memory Oversubscription doesn’t mention such a limit to a single array, and indeed claims CC 6.x devices to have a full 49 bit virtual address space.

Can you post the exact code you use to allocate the array? As txbob has already tested that allocating a 26GB array worked for him, I am somehow suspicious that integer overflow is the issue here, rather than device memory size.

The allocation line is

cudaMallocManaged((void**) &array, 2 * 512 * 512 * 512 * 13 * sizeof(double)) );

which amounts to 2512^313 = 3,489,660,928 doubles = 27,917,287,424 bytes.

I would be very happy to not have to divide these arrays by hand - it wouldn’t be the worst thing ever (that factor of 2 can easily be split), but I would like to avoid differentiating the CPU and GPU implementations of my code any further.

Integer overflow, as tera suspected. ‘int’ can only represent values up to 2**31-1 = 2,147,483,647. Try this instead:

sizeof(double) * 2 * 512 * 512 * 512 * 13

which will cause the expression to be evaluated using size_t.

Goodness, here’s something I’ll never mess up again. That’s exactly it, thank you.

While we’re here, if anyone has any experience - are there any “best practices” for dealing with problem sizes that exceed GPU RAM? (In my case, by a factor of ~<4 at the end of the day.) I would assume that letting CUDA take care of page faulting is optimal.

I’m hoping that enough computation can fit onto the device at once to mask the host-device transfers, since I know this is the most costly aspect of GPU programming.

Using managed memory and relying on page faulting certainly isn’t “optimal” - manually moving the right data to the device ahead of it being accessed will always be faster.
But using prefetch and usage hints you may get very close to optimal performance with a lot less effort.
Unfortunately CUDA, unlike CPU operating systems, doesn’t offer a memAdvise hint for sequential access yet. I hope this will come with a future CUDA release.