Hello,
when I saw the tutorials about 4.0/4.1 I was very pleased to see that there is cudaHostRegister(), assuming it would suit my situation very well.
I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time. A data acquisition kernel modules accumulates data into this large area used as a ring buffer. A user space application mmap’s this amount into user space at the beginning of the applications, then transfers blocks of data at 10Hz to the GPU for processing. Since I know the memory is contiguous, I assume that cudaHostRegister() would tell CUDA to use DMA when transferring the data blocks. But alas, the outcome is not quite what I have expected:
Data block size for cudaMemcpyHostToDevice: 16MB
Data transfer times:
cudaHostMalloc’ed data block: 2.888ms (5540MB/s)
simple malloc’ed data block: 6.194ms (2583MB/s)
memory at memory address 0x100000000 (4GB) mmapped into user space: 175.947ms (91MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, combined with memcpy into cudaHostMalloc’ed staging buffer: 307.952ms (52MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for the block: Kernel execution failed : (30) unknown error.
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for all 188GB: Kernel execution failed : (2) out of memory.
I would have expected that the transfer would have taken 2.888ms for the latter.
Is there something I am doing wrong, or is there a false assumption? Does cudaHostRegister only work for 32 bit addresses?
Any hints much appreciated.
peter
For the records:
Operating system: Linux 2.6.32-40-generic #87-Ubuntu SMP Tue Mar 6 00:56:56 UTC 2012 x86_64 GNU/Linux
NVidia driver: NVIDIA-Linux-x86_64-285.05.33
toolkit: cudatoolkit_4.1.28_linux_64_ubuntu10.04
SDK: gpucomputingsdk_4.1.28_linux