I have a application where each PE(CPU thread) allocates its own instance of GPU context. Each threads calls cudaMalloHost for sizes from memPoolBoundaries[19] =[pow(2,8) : pow(2,26)] bytes when the application is initiated. Each PE has multiple buffer allocation for these sizes. If I run with lesser number of PEs say 8 it works, but on increasing the PEs to 64 it throws an error “mapping of buffer object failed”.
ALL PEs execute this in init phase
for(int i = 0; i < 19 ; i++){
int bufSize = CpvAccess(gpuManager).memPoolBoundaries[i]; // size of memory required (256,512,1024…)
int numBuffers = nbuffers[i]; // no. of allocation for size of bufSize
pools[i].size = bufSize;
pools[i].head = NULL;
Header *hd = pools[i].head;
Header *previous = NULL;
for(int j = 0; j < numBuffers; j++){
cudaChk(cudaMallocHost((void **)&hd,(sizeof(Header)+bufSize)));
It fails on mallocHost with an error mapping of buffer object failed
This is outside my area of expertise, but from what I understand, cudaMallocHost() maps in relative straightforward fashion to a relevant OS function, I think mmap in Linux. So the size of such allocations is limited by the OS, and CUDA has no up-front knowledge of the available memory, it simply passes through the return status of the OS call, suitably translated.
If I remember correctly, cudaMallocHost requires tracking structures to be allocated in GPU memory so the GPU knows how to forward accesses to the CPU’s address space for that memory. So if you are right on the edge of running out of GPU memory, a big cudaMallocHost can push you over the edge.
I forget how much memory this takes, but you can figure it out by allocating a huge amount of memory with cudaMallocHost, and measuring the delta in device memory usage.
And as njuffa mentions, the OS can run out of pinned memory, and there are usually ways to increase the limit (e.g. ulimit).
If memsize > 1GB in:
cudaHostAlloc((void **)&in_array, memsize, cudaHostAllocDefault)
I get “CUDA Runtime Error: out of memory”. However, I need a much larger cudaHostAlloc memory, and it is available in the suystem.
cudaHostAlloc() is a thin wrapper around operating system API calls. How much system memory can be allocated depends on the system specifications and the operating system. What are your system specifications, what is the operating system?
In most circumstances, allocation of a few GB should present no issues. For example, the following works just fine on a Windows system with 8 GB of physical system memory.