Arbitrary Device Limit On Pinned Host Memory

This may be a lack of understanding on my part but there appears to me to be an arbitrary limit on the amount of host memory that can be mapped and pinned/locked by cuda that has absolutely NOTHING to do with the host.

I have a system with 200GB+ host RAM and a K40C with 12GB RAM. I can mmap and mlock 16GB+ of RAM on the host no problem. When I attempt to use cudaHostRegister, either entire system hangs or I get an out of memory error from cuda if I attempt to use cudaHostRegister with anything more than about 8GB total.

Anyone got any clues as to what the problem might be?

What operating system are you using?
Is there any difference in behavior if you use cudaHostAlloc instead of cudaHostRegister?

So cudaHostAlloc works up to 12GB, between 12GB & 14GB the system hangs, and from 14GB and larger it fails with error 2

/**
* The API call failed because it was unable to allocate enough memory to
* perform the requested operation.
*/
cudaErrorMemoryAllocation = 2,

I’m running Fedora 20.

So here is the thing that I don’t get. On every linux distro I’ve checked, the maximum amount of memory that a standard user process can lock is 64KB. As in the soft limit. The hard limit is always set to unlimited. I reset the soft limit to unlimited as a privileged user, then I can lock as much memory as I like until I attempt to mlock (or mmap with MAP_LOCKED) more memory than the system can spare. Then the process just gets killed immediately. As opposed to the hang that often happens with cudaHostMalloc or cudaHostRegister. So the hang that can happen wth cudaHostMalloc or cudaHostRegister is not the same behaviour as I see from the host.

Since I can get at least 8GB of pinned memory with either cudaHostRegister or cudaHostMalloc, that tells me that the nvidia driver/cuda must in some way be privileged enough to lock more than the 64KB limit.

Both these things together suggest to me that it is the nvidia driver/cuda that is enforcing the limit of 12GB.

on RHEL 6.2, using CUDA 5.5, on a server with 48GB main memory, I had no trouble pinning 16GB of memory:

$ cat t525.cu
#include <stdio.h>
#define DSIZE (1048576ULL * 1024ULL * 16ULL)

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(){

  int *p;
  cudaHostAlloc(&p, DSIZE, cudaHostAllocMapped);
  cudaCheckErrors("Host Alloc fail");
  printf("finished\n");
  return 0;
}
$ nvcc -arch=sm_20 -o t525 t525.cu
$ time ./t525
finished

real    0m14.987s
user    0m0.010s
sys     0m14.747s
$

I don’t have the slightest idea why you can and I can’t. I’m using cuda 6.0 and compiling for 3.x architectures, so unless there was a regression somewhere I can only presume it’s something stupid I’m doing. But I can’t for the life of me figure out what.

You’re sure the allocation is actually being performed? Since you’re not doing anything with the allocated memory I’m wondering if maybe the compiler isn’t just optimizing everything away? Just a question?

A rudimentary check seemed to work:

$ cat t525.cu
#include <stdio.h>
#define DSIZE (1048576ULL * 1024ULL * 16ULL)
#define CHECK_VAL 50
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void check_kernel(int *data){
  if (data[0] != CHECK_VAL) {printf("location 0 check failed\n"); return;}
  if (data[(DSIZE/(sizeof(int)))/2] != CHECK_VAL+1) {printf("micpoint check failed\n"); return;}
  if (data[(DSIZE/(sizeof(int)))-1] != CHECK_VAL+2) {printf("final location check failed\n"); return;}
  printf("kernel checks passed\n");
}

int main(){

  int *p;
  cudaHostAlloc(&p, DSIZE, cudaHostAllocMapped);
  cudaCheckErrors("Host Alloc fail");
  p[0] = CHECK_VAL;
  p[(DSIZE/(sizeof(int)))/2] = CHECK_VAL+1;
  p[(DSIZE/(sizeof(int)))-1] = CHECK_VAL+2;
  check_kernel<<<1,1>>>(p);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  printf("finished\n");
  return 0;
}
$ nvcc -arch=sm_20 -o t525 t525.cu
$ time ./t525
kernel checks passed
finished

real    0m14.991s
user    0m0.009s
sys     0m14.747s
$

Just for grins I tried 32GB (change the 16ULL to 32ULL) and it passed too.

Thanks txbob. Appreciate your efforts.

I guess the only conclusion is that there is something messed up with my config/compile/server/something because your code didn’t work for me either.