Is there a limit to cudaMallocHost memory allocation ? Mapping of buffer object failed while using c

I have a application where each PE(CPU thread) allocates its own instance of GPU context. Each threads calls cudaMalloHost for sizes from memPoolBoundaries[19] =[pow(2,8) : pow(2,26)] bytes when the application is initiated. Each PE has multiple buffer allocation for these sizes. If I run with lesser number of PEs say 8 it works, but on increasing the PEs to 64 it throws an error “mapping of buffer object failed”.

ALL PEs execute this in init phase

for(int i = 0; i < 19 ; i++){
int bufSize = CpvAccess(gpuManager).memPoolBoundaries[i]; // size of memory required (256,512,1024…)
int numBuffers = nbuffers[i]; // no. of allocation for size of bufSize
pools[i].size = bufSize;
pools[i].head = NULL;
Header *hd = pools[i].head;
Header *previous = NULL;
for(int j = 0; j < numBuffers; j++){
cudaChk(cudaMallocHost((void **)&hd,(sizeof(Header)+bufSize)));
It fails on mallocHost with an error mapping of buffer object failed

I really would like to understand how do I make it scalable to 100 or 1000s of nodes for cray machines.

This is outside my area of expertise, but from what I understand, cudaMallocHost() maps in relative straightforward fashion to a relevant OS function, I think mmap in Linux. So the size of such allocations is limited by the OS, and CUDA has no up-front knowledge of the available memory, it simply passes through the return status of the OS call, suitably translated.

If I remember correctly, cudaMallocHost requires tracking structures to be allocated in GPU memory so the GPU knows how to forward accesses to the CPU’s address space for that memory. So if you are right on the edge of running out of GPU memory, a big cudaMallocHost can push you over the edge.

I forget how much memory this takes, but you can figure it out by allocating a huge amount of memory with cudaMallocHost, and measuring the delta in device memory usage.

And as njuffa mentions, the OS can run out of pinned memory, and there are usually ways to increase the limit (e.g. ulimit).

Is there a limit to cudaMallocHost size?

If memsize > 1GB in:
cudaHostAlloc((void **)&in_array, memsize, cudaHostAllocDefault)
I get “CUDA Runtime Error: out of memory”. However, I need a much larger cudaHostAlloc memory, and it is available in the suystem.

Any ideas?.

cudaHostAlloc() is a thin wrapper around operating system API calls. How much system memory can be allocated depends on the system specifications and the operating system. What are your system specifications, what is the operating system?

In most circumstances, allocation of a few GB should present no issues. For example, the following works just fine on a Windows system with 8 GB of physical system memory.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int main (void)
{
    cudaError_t stat;
    size_t memsize = 2ULL * 1024 * 1024 * 1024;
    uint8_t *in_array;
    stat = cudaHostAlloc((void **)&in_array, memsize, cudaHostAllocDefault);
    printf ("stat = %s\n", cudaGetErrorString (stat));
    if (stat == cudaSuccess) {
        printf ("successfully allocated %zu bytes\n", memsize);
        cudaFreeHost (in_array);
        return EXIT_SUCCESS;
    } else {
        printf ("allocation failed\n");
        return EXIT_FAILURE;
    }
}

This prints:

stat = no error
successfully allocated 2147483648 bytes

Allocation of 4 GB (that is, half the physical system memory) is likewise successful on this system but does seem to trigger some swapping to disk.

1 Like