cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst

Hello,

when I saw the tutorials about 4.0/4.1 I was very pleased to see that there is cudaHostRegister(), assuming it would suit my situation very well.

I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time. A data acquisition kernel modules accumulates data into this large area used as a ring buffer. A user space application mmap’s this amount into user space at the beginning of the applications, then transfers blocks of data at 10Hz to the GPU for processing. Since I know the memory is contiguous, I assume that cudaHostRegister() would tell CUDA to use DMA when transferring the data blocks. But alas, the outcome is not quite what I have expected:

Data block size for cudaMemcpyHostToDevice: 16MB
Data transfer times:

cudaHostMalloc’ed data block: 2.888ms (5540MB/s)
simple malloc’ed data block: 6.194ms (2583MB/s)
memory at memory address 0x100000000 (4GB) mmapped into user space: 175.947ms (91MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, combined with memcpy into cudaHostMalloc’ed staging buffer: 307.952ms (52MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for the block: Kernel execution failed : (30) unknown error.
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for all 188GB: Kernel execution failed : (2) out of memory.

I would have expected that the transfer would have taken 2.888ms for the latter.

Is there something I am doing wrong, or is there a false assumption? Does cudaHostRegister only work for 32 bit addresses?

Any hints much appreciated.
peter

For the records:

Operating system: Linux 2.6.32-40-generic #87-Ubuntu SMP Tue Mar 6 00:56:56 UTC 2012 x86_64 GNU/Linux
NVidia driver: NVIDIA-Linux-x86_64-285.05.33
toolkit: cudatoolkit_4.1.28_linux_64_ubuntu10.04
SDK: gpucomputingsdk_4.1.28_linux

I have never looked at the glue code that comes with Nvidia’s Linux drivers, but I would assume it needs some adjustments in order to be able to pin mmapped memory.

Can you provide a repro?

By further Googling and searching through this forum, I have found a related post by tmurrey who said that cudaHostRegister cannot be used with mmap’ed memory. This makes somehow sense if cudaHostRegister is designed to work on malloc’ed memory and flags the memory pages as LOCKED for the kernel. mmap does not do any mallocs but its own bits (such as MAP_LOCK).

My case is indeed a bit different, I know that my reserved memory from 4GB-192GB is contiguous, and I know that our FPGA card can DMA into this very memory area. I assume that the M2090 can also DMA to and from this area, i.e. DMA can handle 64 bit addresses. Now how tell CUDA to do just that, i.e. to use DMA to transfer from my reserved memory up to the M2090?

I played around with using a cudaMallocHost()'ed staging buffer and copied data from my reserved memory region into that staging buffer. It turned out that memory reads from my mmap’ed reserved physical memory are very slow (50MB/s). I do not understand yet why this is the case, but it is rather a Linux issue than a GPU issue. If this memcpy was fast, then I could probably life with the staging buffer approach.

Kind regards,

peter

it works fine with mmap’d memory–that’s basically how cudaMallocHost is implemented as of 4.1/r285.

Well, I presume the NVidia kernel module does a kmalloc which is then mmap’ed by the CUDA library. Then there is something in my code that does not do what cudaHostRegister is expecting.

As mentioned earlier, I reserve memory from 4GB to 192GB at boot time using the linux mem=4G and memmap=188G$4G commandline parameters. In a dedicated character driver for this reserved memory I implemented the mmap file operation to allow a user space program to use mmap to map the physical memory into user space virtual memory (the flags I am using for mmap are MAP_FILE | MAP_SHARED).

Using this memory area straight away to copy data to the GPU works, but with poor performance. Applying cudaHostRegister causes an unknown error for small memory blocks (16MB), and an out of memory error for all 188GB.

I guess the difference between the cudaMallocHost and my approach is that the memory reserved at boot time is not managed by Linux as such, even though Linux’s mmap will produce a virtual memory space for the physical memory area. Perhaps this is the reason for the poor performance. Having said that, even NVidia’s mmap file operation will have to use

remap_pfn_range(vma, vma->vm_start,

    virt_to_phys((void *)kmalloc_area) >> PAGE_SHIFT,

    size, vma->vm_page_prot))

which in my case is:

remap_pfn_range(vma, vma->vm_start,

    resmem_hwaddr >> PAGE_SHIFT,

    resmem_length, vma->vm_page_prot))

Bottom line, cudaHostRegister does not work for me with Cuda 4.1 on Linux 2.6.32-40-generic (Ubuntu 10.04.4 LTS) now, but I remain confident that it can be made to work. Only how is the question.

Regards,

peter

as of 4.1/r285, we mmap in the user-mode process and do the same internal operations as cudaHostRegister to implement cudaMallocHost.

at this point, your best bet is to file a bug as a registered developer with as much information as you can provide. I’m guessing we don’t support MAP_FILE with cudaHostRegister (all of our mmap calls use MAP_ANONYMOUS) but that’s just a guess from someone that doesn’t know that part of the codebase particularly well.

Thanks for the reply. I’ll try MAP_ANONYMOUS next instead of MAP_FILE. Perhaps this is the key.

I also tried to copy from malloc’ed memory (through c++ new) into my mmap’ed reserved physical memory and found very poor performance, about 50MB/s. As a matter of fact, the performance between memcpy(mmapedDest, mallocSource, 1610241024UL) was pretty much the same as doing a *dest++ = source++ loop (using unsigned long and the appropriate number of iterations). This alludes to the assumption that memcpy does not use page copies but byte per byte copies, which might explain why the NVidia driver cannot use DMA either.

My gut’s feeling is that the key lies in the appropriate flags for mmap in user space. Perhaps Nvidia could give some recommendations to how to use mmap in order to make the GPU use DMA. (Even though it is a mystery to me how the driver could possibly derive from an 64 bit quantity, the address, whether it can use it for DMA or not).

Kind regards,

peter

Doesn’t that make it obvious that the problem lies somewhere other than CUDA? What OS are you using and what’s the (file) cache setup? Are you sure there is no disk thrashing during this copy? What if you copy into the same region multiple times? Is it that slow every time?

Indeed it does. As mentioned in my first post I am using Ubuntu 10.04.4, Linux Linux 2.6.32-40-generic, but I do not know what the (file) cache setup is. I doubt that there is a relation to the hard disk. I have run the copies in a loop, and it is always the same, always equally slow, as if a memcopy copies 64bit quantities in a loop (I know this because I have tried to cast the pointer to different types from unsigned char to unsigned long, and the result is that the time memcpy takes is the same as a loop copy of unsigned long’s).

Well, bottom line is that somehow I do not manage to convey the contiguous nature of my memory area to CUDA nor to the Linux kernel. In addition, the poor performance could be the result of the Linux kernel not managing the reserved memory, or that reads/writes from/to that memory do not go through the cache, or other things, or a combination. One thing is sure, 50MB/s is slow even without cache.

Kind regards,

peter

Hello,

Today I have carried out all sorts of test, the results of which I would like to share on this forum. I also attach the source code used, which should build easily on a Linux computer.

It appears that the performance depends on the size of the memory reserved at boot time and later mmap’ed into user space. I have carried out mempcy tests of a 16MB data block in three scenarios:

A: memcpy from mmap’ed to malloc’ed (through new operator)

B: cudaMemcpy from mmap’ed to cudaMalloc’ed (device memory), which is what I am mostly concerned about

C: cudaMemcpy from cudaMallocHost’ed to cudaMalloc’ed (device memory)

The resulting transfer time (and speed in MB/s) for the different sizes of reserved RAM to be mmap’ed is in the following table (two runs)

|         1GB           |          4GB           |         16GB           |        64GB            |        128GB            |         188GB

A: | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) |  9.326ms (1798.97MB/s) | 213.892ms (  78.43MB/s) | 206.476ms (  81.25MB/s)

B: | 4.494ms (3733.25MB/s) |  4.508ms (3721.65MB/s) |  4.706ms (3565.07MB/s) |  4.513ms (3717.53MB/s) | 111.657ms ( 150.25MB/s) | 112.231ms ( 149.48MB/s)

C: | 2.933ms (5720.16MB/s) |  2.949ms (5689.12MB/s) |  2.946ms (5694.91MB/s) |  2.951ms (5685.27MB/s) |   3.001ms (5590.54MB/s) |   3.022ms (5551.69MB/s)

A: | 4.255ms (3942.94MB/s) |  4.249ms (3948.51MB/s) |  4.257ms (3941.09MB/s) |  4.298ms (3903.49MB/s) | 208.269ms (  80.55MB/s) | 200.627ms (  83.62MB/s)

B: | 4.37ms  (3839.18MB/s) |  4.411ms (3803.5MB/s)  |  4.456ms (3765.08MB/s) |  4.391ms (3820.82MB/s) | 111.562ms ( 150.38MB/s) | 112.09ms  ( 149.67MB/s)

C: | 2.937ms (5712.37MB/s) |  2.949ms (5689.12MB/s) |  2.949ms (5689.12MB/s) |  2.954ms (5679.49MB/s) |   2.976ms (5637.51MB/s) |   2.985ms (5620.51MB/s)
  1. cudaMemcpy from cudaMallocHost’ed to cudaMalloc’ed is of course unaffected by the mmap (case C). The transfer is carried out in under 3ms, or 5.5GB/s which is quite good.

  2. From the first to the second run, memcpy from mmap’ed to malloc’ed seems to benefit that the contents might already be cashed somewhere.

  3. There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy, but also when using cudaMemcpy. The latter appears to obtain still twice the performance of memcpy.

  4. For reserved memory <= 64GB cudaMemcpy from mmap’ed to cudaMalloc’ed takes about 50% longer which is unexpected because the memory is known to be contiguous and the Nvidia device’s DMA engine should be able to exploit that.

I would love to read an explanation for this behavior. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)

Kind regards,

peter

Peter, you might get a quicker response if you make your example tiny(I know it’s small, but make it tiny and remove ALL cuda related items) and then ask at the Linux Kernel Mailing List. Be sure to tell them your Kernel version and hardware setup(motherboard/CPU/RAM configuration.)

Hello RezaRob, yes this is what I have done last night on stackoverflow. The first response mentioned cache size, even though it does not sound so plausible to me, as DMA transfer controlled by the Nvidia driver should be unaffected by cache. In addition, memory transfer exploiting cache would rather depend on the size of the memory transferred and not on the size of the memory reserved.

What is still a puzzle to me is that for reserved RAM < 64GB, where the performance is not too bad, a cudaMemcpy from mmap’ed to cudaMalloc’ed (device memory) takes 50% longer than cudaMemcpy from cudaMallocHost’ed to cudaMalloc’ed (device memory). My only explanation would be that the Nvidia driver, as it does not know about the contiguous nature of the mmap’ed memory, has to go trough all pages and either has to work out a DMA transfer schedule, or it has to do DMA on a page per page basis. In either case, would there be a way to tell the Nvidia driver that it should not worry and just assume that the mmap’ed are is indeed contiguous?

Kind regards,
peter

How do you know that it’s contiguous?

Well, I reserve memory at boot time on the boot command line as “memmap=64G$128G”; my assumption is that this memory area set aside by the Linux kernel refers to a contiguous block of 64GB physical memory at offset 128GB. Of course, mmap will present this memory area in virtual address space to user space, and perhaps there the virtual memory pages are not contiguous any more. But I guess this does not matter, as the Nvidia driver will get the physical addresses back somehow and realize that the underlying physical memory pages are indeed contiguous. Maybe it is this process that takes time, and if the Nvidia driver allocates the contiguous memory itself, it does not have to do that at a later stage when it comes to DMA transfer.

For now I will have to live with the 50% performance hit in comparison to cudaMallocHost’ed memory, and with the fact that I can only use 64GB RAM.

Regards,

peter

Hi Peter,

I’m really surprised that you keep bringing up the Nvidia driver! Are you sure this isn’t because of your hardware setup somehow? Or related to how Linux is mapping this space? Your own tests clearly demonstrate that the fastest possible “mmap-to-RAM” transfer you can obtain is exactly equivalent to how fast the Nvidia driver is pushing this data into the PCI pipe. How can it do any better than the absolute maximum which you can demonstrate your hardware is capable of?

I am not disputing or blaming the NVidia driver; the performance with cudaMallocHost’ed memory is consistent and very close to the hardware limit (well done Nvidia). Let me re-phrase what I tried to explain:

I reserve physical memory at boot time using the memmap=64G$128G Linux commandline argument; by its nature this memory reason has to be contiguous. I have written a small kernel driver that implements a mmap file operation:

module_param(resmem_hwaddr, ulong, S_IRUSR);

module_param(resmem_length, ulong, S_IRUSR);

//...

static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {

remap_pfn_range(vma, vma->vm_start,

    resmem_hwaddr >> PAGE_SHIFT,

    resmem_length, vma->vm_page_prot);

return 0;

}

A user space program opens the respective character device and calls mmap which will map the aforementioned physical memory into the user space’s virtual memory. When the user space program then calls cudaMemCpy() to copy a block of data from this mmap’ed physical memory into device memory, then I do not achieve the same performance as I do with cudaHostAlloc’ed memory:

#define BLOCKSIZE ((size_t)16*1024*1024)

int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);

unsigned long resMemLength = 0;

::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);

void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);

char* source = ((char*)resMemBase);    

char* destination;

cudaMallocHost(&destination, BLOCKSIZE);

struct timeval start, end;

gettimeofday(&start, NULL);

memcpy(destination, source, BLOCKSIZE);

gettimeofday(&end, NULL);

My explanation for this difference is that the Nvidia driver inherently knows about the contiguous nature of a memory block, if that block was allocated with cudaMallocHost. Hence it can exploit this knowledge to instruct the device DMA controller to carry out an DMA transfer for the entire block. If, however, cudaMemCpy is used on a block that was not allocated by the NVidia driver, the latter cannot know about the contiguous nature of the block. Presumingly, before being able to transfer the data to the device, the driver will have to sweep through the memory block first, probably page-wise, determine the underlying physical addresses, and work out a DMA schedule. Alternatively it would have to use an internal DMA’able staging buffer.

So what I am after is some call into the driver that allows me to pass on my a-priori knowledge that allows the driver to copy data as efficiently as any cudaMallocHost’ed area. My original assumption was that cudaRegisterHost does exactly that. Unfortunately it has no effect or crashes when applied to the address return by my mmap.

Kind regards,

peter

I posted a potentially related topic on the new forums at:

http://forums.developer.nvidia.com/devforum/discussion/7386/portable-pinned-memory-and-multiple-gpus-performance-and-stability

(BTW, which forums are we supposed to use?)

Covered in that post, I get driver/kernel crashes when accessing portable pinned memory from multiple CUDA contexts. I also get very slow transfers for the context in which the memory was not allocated (i.e., 20-200x slower transfers than for the allocating context). There is some code and a kernel log included in the other post. I can port the post over here as a new topic if this is the more appropriate forum.

Regards,
Tom

Hello Tom, there are several things going on in your test, and I wonder whether it would be better to isolate matters first:

  1. concurrency. Several threads are concurrently copying from a memory area, whether it was cudaMallocHost’ed or malloc’ed, and are hence competing for it. How does your program behave if you keep it simpler and only investigate concurrent memory access? Perhaps in theory concurrent reads from the same memory source does not affect performance, but maybe there is a penalty in how the MMU (and the cache) manage reads from the same location but different thread contexts. Perhaps cudaMemcpy is not thread safe? It would entail two DMA engines from two different devices carrying out data transfers concurrently, and the Nvidia driver would have to set that up, too. Perhaps it needs some Mutex?

  2. Performance comparison between cudaMemCpy from cudaMallocHost’ed and malloc’ed areas could be done as a separate test in a single thread. It should make a difference as though cudaMallocHost you should be able to obtain roughly 5.6GB/s.

Regards,

peter

Hi Peter, thanks for the response. My responses are inlined below.

I have also tested guarding the cudaMemcpy() calls with a mutex, but I got the same results (poor performance for the out-of-context copy).

Yes, with a single thread I get better performance with pinned transfers. However, for my use case I need to transfer the same data to multiple devices. Peer-to-peer transfers may be an option, but that raises other issues.

Regards,

Tom