gpu swapping

mahmood.nt · May 11, 2018, 12:15pm

Is there any swapping mechanism for gpu? Or the process will be definitely killed by reaching the memory limit? If there isn’t, can user enable such feature? If it is enabled, can a non root user disable that?

Robert_Crovella · May 11, 2018, 12:54pm

A pascal or volta GPU running in linux OS can have its memory “oversubscribed”. In that case, the GPU runtime will swap pages of memory as needed between host and device. In order to take advantage of this, the memory must be allocated with a managed allocator, such as cudaMallocManaged

[url]Programming Guide :: CUDA Toolkit Documentation
[url]Programming Guide :: CUDA Toolkit Documentation

mahmood.nt · May 11, 2018, 1:03pm

Does that mean for maxwell and olders, the process will only be killed?

Moreover, what about binary files running on gpu, assume we have only an executable. How can we find out that swapping will be used or not for pascal and volta?

Robert_Crovella · May 11, 2018, 1:13pm

The process won’t be killed (at least, not by the CUDA runtime). GPU memory is allocated using a function like cudaMalloc

If you request more than what is available, cudaMalloc will return an error. Beyond that, the application/process behavior is a function of what that application does with that error.

You can profile an application to determine whether or not swapping occurred during execution. Read the profiler manual.

mahmood.nt · May 11, 2018, 1:21pm

For the counters in compute 5.x which are described in the nvprof manual, I don’t see any thing about swapping. Maybe the name is not exactly “swap”. Do you know?

Robert_Crovella · May 11, 2018, 1:29pm

compute 5.x doesn’t support demand paging

I mentioned in my first comment that it had to be a pascal or volta GPU (6.x or 7.x).

maxwell (5.x) does not support swapping/paging.

both nvprof and the visual profiler can display data about Page Faults

Please read the profiler manual, paying attention to Page Fault

I don’t see any metrics for this, but I didn’t look carefully.

mahmood.nt · May 12, 2018, 8:43am

On M2000 (5.x), The program uses less than 4GB of memory. As I run it with nvprof, I have noticed that the memory usage increases which means that the events I selected use memory. That is fine though.
For a test, I applied too many events. The program hasn’t been killed yet.

Prior to that and without nvprof, I increased the problem size and I am sure that the memory usage should be more than 4GB. However, the program wouldn’t be killed either and it had no progress. More precisely, the progress was really slow.

So, I think M2000 uses swap or paging or anything else in order not to kill the program.

You said (and the manual says [1]) that Maxwell has no event related to paging or swapping. How that can be justified?

[1] https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-5x

Robert_Crovella · May 12, 2018, 8:54am

It’s OK if you disagree with me or don’t believe me.

I can’t explain the behavior of a program you haven’t shown. If you want to claim that M2000 employs paging, you’re welcome to believe that. I wouldn’t claim that.

I don’t know what it means to say “How that can be justified?”

Are you asking “Why are there no metrics related to demand-paging?”

If so, I don’t know the reason why on cc6.x or cc7.x. But on cc5.x I wouldn’t expect to see any metrics related to demand paging, because a cc5.x device does not support demand paging. Of course if you disagree, you’re welcome to your opinion, but I wouldn’t be able to respond to anything based on that.

BulatZiganshin · May 12, 2018, 10:58am

from the top of my memory, maxwell becomes much slower when almost entire memory is used and you access memory pages in random fashion - the reason probably is limited size of TLB cache. so if you wrong and your program is using a little less than 4 GB memory, it may be reason of slowdown.

mahmood.nt · May 12, 2018, 11:36am

I am not talking about personal points of view. Let me state the problem in another way. Forget about nvprof…

I have a gpu binary file which is run on M2000. When the input size is small, the program runs fine. However, when the input size is large, the memory usage reported by nvidia-smi is the max value. The screen sometimes becomes unstable. Window refreshing is also slow some times.

The input size I gave to the program should be larger than 4GB. If there is no swapping mechanism, the program must be killed. But it is alive! So I interpret that means there is a swap/page solution on M2000. I am not aware of internals of maxwell. This is what I see.

Any comment?

@BulatZiganshin
I agree with you. That means, a careful run should be done in order not enter the slow down.

BulatZiganshin · May 12, 2018, 11:49am

try to make very simple program that allocs a little more than 4 GiB, fills entire array with data and exits. may be, you don’t take into account difference between GB and GiB? :)

zsh912006 · March 26, 2019, 1:37pm

For gpu oversubscription, can we limit the host memory for swapping?

Robert_Crovella · March 26, 2019, 7:43pm

I assume by “for swapping” you mean migration of data between host and device.

You cannot limit the host memory. The way to limit the host memory is to reduce your allocation size.

If you allocate 100GB of managed data, that will use potentially 100GB of host memory.

If by “for swapping” you meant swapping from host memory to disk, that has nothing to do with managed memory or this discussion.

zsh912006 · March 27, 2019, 1:49am

I indeed mean migration of data between host and device. Thanks for your reply.

drmichaelt · August 8, 2021, 7:00pm

If I understand it correctly, a GPU with oversubscribed memory will use the host’s DRAM as “swap” space, correct?

What happens if the host process that is launching GPU kernels is pinned to specific NUMA domain? Does the GPU attempt to swap to that NUMA domain exclusively or it can use any of the host’s NUMA domains?

How can we tell that a GPU is “swapping” to the host’s DRAM?

Can we get a notification when the GPU starts swapping into the host’s memory?

Thanks!
Michael

Robert_Crovella · August 9, 2021, 4:03am

That might be one way of saying it, although I think it might be leading you down a confusing alley to think of it that way. I would describe it as an allocation that can live in either host or device memory, and the pages of that allocation will move to the processor (host or device) that needs to access them. At some point, this might cause “swapping” but that is certainly not a complete description of the behavior. Pages migrate to the processor that demands them. And yes, this might result in swapping. But data can move to the host when the host demands it, even if “eviction” or “swapping” would be otherwise unnecessary.

Think of it as a memory allocation. The behavior will be the same as any other memory allocation done by that process.

The profiler can tell you when page migration is occurring.

drmichaelt · August 10, 2021, 4:59pm

Thank you Robert!

Referring to A100 and V100, I understand that coding has to use cudaMallocManaged() that allocates a “unified” memory view from the virtual address space point of view.

a)Does this VM space occupy physical space all the time on the host’s DRAM? I understand that there is a mechanism that allows MemRead/MemWrites to unified memory from both sides, but do these memory locations also occupy physical memory on the host side ?

b) Does oversubscription of the GPU memory trigger the GPU initiating moving a page out to host’s DRAM?

c) If the process on the host that launches GPU kernels to a GPU is restricted to allocate memory space out of a particular domain experiences GPU memory oversubscription would the GPU be free to use any memory domain to save the excess memory or it can only write out to the NUMA domain the host’s process is restricted to? I am trying to see if in this oversubscription case a GPU can allocate physical memory from any NUMA domain or if it’s restricted to allocate from the same domain as the host process.

Thanks!

Robert_Crovella · August 10, 2021, 6:02pm

No, not necessarily. A modern memory system in a modern OS like windows or linux uses demand-paged virtual memory. This leverages that. When a logical page needs to be “brought into existence” (i.e. take up space in the host’s DRAM) then the host OS demand-paging process handles that.

A confusing question to me. Oversubscription simply means that the allocation request exceeds the physical DRAM size. The GPU will move a page out to host DRAM (“swapping”) if the current page fault on the GPU side requests a page that is not currently present, and the page table entries are “full” i.e. all physical DRAM on the GPU currently has logical pages instantiated. In that case, one of the logical pages will be removed from the paging table (so a future request to it would cause a page fault) and the contents of that page will be put “somewhere else”. That somewhere else is probably host DRAM.

The answer to this isn’t any different from how the host OS handles an allocation of that size. This isn’t specified or controlled by CUDA. Let’s say we are talking about a 100GB allocation. Lets say you did int *a = (int *)malloc(100000000000); in host code. Where will those pages corresponding to that allocation live? The answer to that is the answer to where the GPU managed memory system will put allocated pages of data, when you do int *a; cudaMallocManaged(&a, 100000000000);

66RING · October 29, 2022, 7:43am

Hello,

It looks like cudaMallocManaged can’t use the swap space on disk. it just get kill by oom-killer when run out of GPU’s memory and DRAM . libc malloc can swap out data to disk when run out of ram thanks to the virtual memory.

So its there a way to make use of GPU’s memory + DRAM + swap space on disk?

Robert_Crovella · October 29, 2022, 1:52pm

There isn’t any way to use the swap space on disk for anything that would be directly accessible from GPU device code. Host pageable allocations may have some connection to swap usage, but host pageable allocations are not accessible from GPU device code.