cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst
  1 / 2    
Hello,

when I saw the tutorials about 4.0/4.1 I was very pleased to see that there is cudaHostRegister(), assuming it would suit my situation very well.

I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time. A data acquisition kernel modules accumulates data into this large area used as a ring buffer. A user space application mmap's this amount into user space at the beginning of the applications, then transfers blocks of data at 10Hz to the GPU for processing. Since I know the memory is contiguous, I assume that cudaHostRegister() would tell CUDA to use DMA when transferring the data blocks. But alas, the outcome is not quite what I have expected:

Data block size for cudaMemcpyHostToDevice: 16MB
Data transfer times:

cudaHostMalloc'ed data block: 2.888ms (5540MB/s)
simple malloc'ed data block: 6.194ms (2583MB/s)
memory at memory address 0x100000000 (4GB) mmapped into user space: 175.947ms (91MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, combined with memcpy into cudaHostMalloc'ed staging buffer: 307.952ms (52MB/s)
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for the block: Kernel execution failed : (30) unknown error.
contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for all 188GB: Kernel execution failed : (2) out of memory.

I would have expected that the transfer would have taken 2.888ms for the latter.

Is there something I am doing wrong, or is there a false assumption? Does cudaHostRegister only work for 32 bit addresses?

Any hints much appreciated.
peter
--

For the records:

Operating system: Linux 2.6.32-40-generic #87-Ubuntu SMP Tue Mar 6 00:56:56 UTC 2012 x86_64 GNU/Linux
NVidia driver: NVIDIA-Linux-x86_64-285.05.33
toolkit: cudatoolkit_4.1.28_linux_64_ubuntu10.04
SDK: gpucomputingsdk_4.1.28_linux
Hello,



when I saw the tutorials about 4.0/4.1 I was very pleased to see that there is cudaHostRegister(), assuming it would suit my situation very well.



I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time. A data acquisition kernel modules accumulates data into this large area used as a ring buffer. A user space application mmap's this amount into user space at the beginning of the applications, then transfers blocks of data at 10Hz to the GPU for processing. Since I know the memory is contiguous, I assume that cudaHostRegister() would tell CUDA to use DMA when transferring the data blocks. But alas, the outcome is not quite what I have expected:



Data block size for cudaMemcpyHostToDevice: 16MB

Data transfer times:



cudaHostMalloc'ed data block: 2.888ms (5540MB/s)

simple malloc'ed data block: 6.194ms (2583MB/s)

memory at memory address 0x100000000 (4GB) mmapped into user space: 175.947ms (91MB/s)

contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, combined with memcpy into cudaHostMalloc'ed staging buffer: 307.952ms (52MB/s)

contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for the block: Kernel execution failed : (30) unknown error.

contiguous memory at memory address 0x100000000 (at 4GB) mmapped into user space, but with cudaHostRegister only for all 188GB: Kernel execution failed : (2) out of memory.



I would have expected that the transfer would have taken 2.888ms for the latter.



Is there something I am doing wrong, or is there a false assumption? Does cudaHostRegister only work for 32 bit addresses?



Any hints much appreciated.

peter

--



For the records:



Operating system: Linux 2.6.32-40-generic #87-Ubuntu SMP Tue Mar 6 00:56:56 UTC 2012 x86_64 GNU/Linux

NVidia driver: NVIDIA-Linux-x86_64-285.05.33

toolkit: cudatoolkit_4.1.28_linux_64_ubuntu10.04

SDK: gpucomputingsdk_4.1.28_linux

#1
Posted 04/18/2012 10:16 AM   
I have never looked at the glue code that comes with Nvidia's Linux drivers, but I would assume it needs some adjustments in order to be able to pin mmapped memory.
I have never looked at the glue code that comes with Nvidia's Linux drivers, but I would assume it needs some adjustments in order to be able to pin mmapped memory.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#2
Posted 04/18/2012 02:16 PM   
Can you provide a repro?
Can you provide a repro?

#3
Posted 04/18/2012 04:34 PM   
[quote name='tera' date='18 April 2012 - 03:16 PM' timestamp='1334758560' post='1397857']
I have never looked at the glue code that comes with Nvidia's Linux drivers, but I would assume it needs some adjustments in order to be able to pin mmapped memory.
[/quote]

By further Googling and searching through this forum, I have found a related post by tmurrey who said that cudaHostRegister cannot be used with mmap'ed memory. This makes somehow sense if cudaHostRegister is designed to work on malloc'ed memory and flags the memory pages as LOCKED for the kernel. mmap does not do any mallocs but its own bits (such as MAP_LOCK).

My case is indeed a bit different, I know that my reserved memory from 4GB-192GB is contiguous, and I know that our FPGA card can DMA into this very memory area. I assume that the M2090 can also DMA to and from this area, i.e. DMA can handle 64 bit addresses. Now how tell CUDA to do just that, i.e. to use DMA to transfer from my reserved memory up to the M2090?

I played around with using a cudaMallocHost()'ed staging buffer and copied data from my reserved memory region into that staging buffer. It turned out that memory reads from my mmap'ed reserved physical memory are very slow (50MB/s). I do not understand yet why this is the case, but it is rather a Linux issue than a GPU issue. If this memcpy was fast, then I could probably life with the staging buffer approach.

Kind regards,
peter
[quote name='tera' date='18 April 2012 - 03:16 PM' timestamp='1334758560' post='1397857']

I have never looked at the glue code that comes with Nvidia's Linux drivers, but I would assume it needs some adjustments in order to be able to pin mmapped memory.





By further Googling and searching through this forum, I have found a related post by tmurrey who said that cudaHostRegister cannot be used with mmap'ed memory. This makes somehow sense if cudaHostRegister is designed to work on malloc'ed memory and flags the memory pages as LOCKED for the kernel. mmap does not do any mallocs but its own bits (such as MAP_LOCK).



My case is indeed a bit different, I know that my reserved memory from 4GB-192GB is contiguous, and I know that our FPGA card can DMA into this very memory area. I assume that the M2090 can also DMA to and from this area, i.e. DMA can handle 64 bit addresses. Now how tell CUDA to do just that, i.e. to use DMA to transfer from my reserved memory up to the M2090?



I played around with using a cudaMallocHost()'ed staging buffer and copied data from my reserved memory region into that staging buffer. It turned out that memory reads from my mmap'ed reserved physical memory are very slow (50MB/s). I do not understand yet why this is the case, but it is rather a Linux issue than a GPU issue. If this memcpy was fast, then I could probably life with the staging buffer approach.



Kind regards,

peter

#4
Posted 04/18/2012 04:35 PM   
[quote name='PeterW' date='18 April 2012 - 08:35 AM' timestamp='1334766942' post='1397917']
By further Googling and searching through this forum, I have found a related post by tmurrey who said that cudaHostRegister cannot be used with mmap'ed memory. This makes somehow sense if cudaHostRegister is designed to work on malloc'ed memory and flags the memory pages as LOCKED for the kernel. mmap does not do any mallocs but its own bits (such as MAP_LOCK).
[/quote]
it works fine with mmap'd memory--that's basically how cudaMallocHost is implemented as of 4.1/r285.
[quote name='PeterW' date='18 April 2012 - 08:35 AM' timestamp='1334766942' post='1397917']

By further Googling and searching through this forum, I have found a related post by tmurrey who said that cudaHostRegister cannot be used with mmap'ed memory. This makes somehow sense if cudaHostRegister is designed to work on malloc'ed memory and flags the memory pages as LOCKED for the kernel. mmap does not do any mallocs but its own bits (such as MAP_LOCK).



it works fine with mmap'd memory--that's basically how cudaMallocHost is implemented as of 4.1/r285.

#5
Posted 04/18/2012 05:32 PM   
[quote name='tmurray' date='18 April 2012 - 06:32 PM' timestamp='1334770377' post='1397934']
it works fine with mmap'd memory--that's basically how cudaMallocHost is implemented as of 4.1/r285.
[/quote]
Well, I presume the NVidia kernel module does a kmalloc which is then mmap'ed by the CUDA library. Then there is something in my code that does not do what cudaHostRegister is expecting.

As mentioned earlier, I reserve memory from 4GB to 192GB at boot time using the linux mem=4G and memmap=188G$4G commandline parameters. In a dedicated character driver for this reserved memory I implemented the mmap file operation to allow a user space program to use mmap to map the physical memory into user space virtual memory (the flags I am using for mmap are MAP_FILE | MAP_SHARED).

Using this memory area straight away to copy data to the GPU works, but with poor performance. Applying cudaHostRegister causes an unknown error for small memory blocks (16MB), and an out of memory error for all 188GB.

I guess the difference between the cudaMallocHost and my approach is that the memory reserved at boot time is not managed by Linux as such, even though Linux's mmap will produce a virtual memory space for the physical memory area. Perhaps this is the reason for the poor performance. Having said that, even NVidia's mmap file operation will have to use

[code]remap_pfn_range(vma, vma->vm_start,
virt_to_phys((void *)kmalloc_area) >> PAGE_SHIFT,
size, vma->vm_page_prot)) [/code]

which in my case is:

[code]remap_pfn_range(vma, vma->vm_start,
resmem_hwaddr >> PAGE_SHIFT,
resmem_length, vma->vm_page_prot)) [/code]

Bottom line, cudaHostRegister does not work for me with Cuda 4.1 on Linux 2.6.32-40-generic (Ubuntu 10.04.4 LTS) now, but I remain confident that it can be made to work. Only how is the question.

Regards,
peter
[quote name='tmurray' date='18 April 2012 - 06:32 PM' timestamp='1334770377' post='1397934']

it works fine with mmap'd memory--that's basically how cudaMallocHost is implemented as of 4.1/r285.



Well, I presume the NVidia kernel module does a kmalloc which is then mmap'ed by the CUDA library. Then there is something in my code that does not do what cudaHostRegister is expecting.



As mentioned earlier, I reserve memory from 4GB to 192GB at boot time using the linux mem=4G and memmap=188G$4G commandline parameters. In a dedicated character driver for this reserved memory I implemented the mmap file operation to allow a user space program to use mmap to map the physical memory into user space virtual memory (the flags I am using for mmap are MAP_FILE | MAP_SHARED).



Using this memory area straight away to copy data to the GPU works, but with poor performance. Applying cudaHostRegister causes an unknown error for small memory blocks (16MB), and an out of memory error for all 188GB.



I guess the difference between the cudaMallocHost and my approach is that the memory reserved at boot time is not managed by Linux as such, even though Linux's mmap will produce a virtual memory space for the physical memory area. Perhaps this is the reason for the poor performance. Having said that, even NVidia's mmap file operation will have to use



remap_pfn_range(vma, vma->vm_start,

virt_to_phys((void *)kmalloc_area) >> PAGE_SHIFT,

size, vma->vm_page_prot))




which in my case is:



remap_pfn_range(vma, vma->vm_start,

resmem_hwaddr >> PAGE_SHIFT,

resmem_length, vma->vm_page_prot))




Bottom line, cudaHostRegister does not work for me with Cuda 4.1 on Linux 2.6.32-40-generic (Ubuntu 10.04.4 LTS) now, but I remain confident that it can be made to work. Only how is the question.



Regards,

peter

#6
Posted 04/18/2012 07:51 PM   
[quote name='PeterW' date='18 April 2012 - 11:51 AM' timestamp='1334778666' post='1397994']
Well, I presume the NVidia kernel module does a kmalloc which is then mmap'ed by the CUDA library. Then there is something in my code that does not do what cudaHostRegister is expecting.[/quote]
as of 4.1/r285, we mmap in the user-mode process and do the same internal operations as cudaHostRegister to implement cudaMallocHost.

at this point, your best bet is to file a bug as a registered developer with as much information as you can provide. I'm guessing we don't support MAP_FILE with cudaHostRegister (all of our mmap calls use MAP_ANONYMOUS) but that's just a guess from someone that doesn't know that part of the codebase particularly well.
[quote name='PeterW' date='18 April 2012 - 11:51 AM' timestamp='1334778666' post='1397994']

Well, I presume the NVidia kernel module does a kmalloc which is then mmap'ed by the CUDA library. Then there is something in my code that does not do what cudaHostRegister is expecting.

as of 4.1/r285, we mmap in the user-mode process and do the same internal operations as cudaHostRegister to implement cudaMallocHost.



at this point, your best bet is to file a bug as a registered developer with as much information as you can provide. I'm guessing we don't support MAP_FILE with cudaHostRegister (all of our mmap calls use MAP_ANONYMOUS) but that's just a guess from someone that doesn't know that part of the codebase particularly well.

#7
Posted 04/18/2012 08:04 PM   
[quote name='tmurray' date='18 April 2012 - 09:04 PM' timestamp='1334779446' post='1398001']
as of 4.1/r285, we mmap in the user-mode process and do the same internal operations as cudaHostRegister to implement cudaMallocHost.

at this point, your best bet is to file a bug as a registered developer with as much information as you can provide. I'm guessing we don't support MAP_FILE with cudaHostRegister (all of our mmap calls use MAP_ANONYMOUS) but that's just a guess from someone that doesn't know that part of the codebase particularly well.
[/quote]

Thanks for the reply. I'll try MAP_ANONYMOUS next instead of MAP_FILE. Perhaps this is the key.

I also tried to copy from malloc'ed memory (through c++ new) into my mmap'ed reserved physical memory and found very poor performance, about 50MB/s. As a matter of fact, the performance between memcpy(mmapedDest, mallocSource, 16*1024*1024UL) was pretty much the same as doing a *dest++ = *source++ loop (using unsigned long* and the appropriate number of iterations). This alludes to the assumption that memcpy does not use page copies but byte per byte copies, which might explain why the NVidia driver cannot use DMA either.

My gut's feeling is that the key lies in the appropriate flags for mmap in user space. Perhaps Nvidia could give some recommendations to how to use mmap in order to make the GPU use DMA. (Even though it is a mystery to me how the driver could possibly derive from an 64 bit quantity, the address, whether it can use it for DMA or not).

Kind regards,
peter
[quote name='tmurray' date='18 April 2012 - 09:04 PM' timestamp='1334779446' post='1398001']

as of 4.1/r285, we mmap in the user-mode process and do the same internal operations as cudaHostRegister to implement cudaMallocHost.



at this point, your best bet is to file a bug as a registered developer with as much information as you can provide. I'm guessing we don't support MAP_FILE with cudaHostRegister (all of our mmap calls use MAP_ANONYMOUS) but that's just a guess from someone that doesn't know that part of the codebase particularly well.





Thanks for the reply. I'll try MAP_ANONYMOUS next instead of MAP_FILE. Perhaps this is the key.



I also tried to copy from malloc'ed memory (through c++ new) into my mmap'ed reserved physical memory and found very poor performance, about 50MB/s. As a matter of fact, the performance between memcpy(mmapedDest, mallocSource, 16*1024*1024UL) was pretty much the same as doing a *dest++ = *source++ loop (using unsigned long* and the appropriate number of iterations). This alludes to the assumption that memcpy does not use page copies but byte per byte copies, which might explain why the NVidia driver cannot use DMA either.



My gut's feeling is that the key lies in the appropriate flags for mmap in user space. Perhaps Nvidia could give some recommendations to how to use mmap in order to make the GPU use DMA. (Even though it is a mystery to me how the driver could possibly derive from an 64 bit quantity, the address, whether it can use it for DMA or not).



Kind regards,

peter

#8
Posted 04/18/2012 09:44 PM   
[quote name='PeterW' date='18 April 2012 - 02:44 PM' timestamp='1334785456' post='1398041']
Thanks for the reply. I'll try MAP_ANONYMOUS next instead of MAP_FILE. Perhaps this is the key.

I also tried to copy from malloc'ed memory (through c++ new) into my mmap'ed reserved physical memory and found very poor performance, about 50MB/s.
[/quote]

Doesn't that make it obvious that the problem lies somewhere other than CUDA? What OS are you using and what's the (file) cache setup? Are you sure there is no disk thrashing during this copy? What if you copy into the same region multiple times? Is it that slow every time?
[quote name='PeterW' date='18 April 2012 - 02:44 PM' timestamp='1334785456' post='1398041']

Thanks for the reply. I'll try MAP_ANONYMOUS next instead of MAP_FILE. Perhaps this is the key.



I also tried to copy from malloc'ed memory (through c++ new) into my mmap'ed reserved physical memory and found very poor performance, about 50MB/s.





Doesn't that make it obvious that the problem lies somewhere other than CUDA? What OS are you using and what's the (file) cache setup? Are you sure there is no disk thrashing during this copy? What if you copy into the same region multiple times? Is it that slow every time?

#9
Posted 04/19/2012 03:12 AM   
[quote name='RezaRob3' date='19 April 2012 - 04:12 AM' timestamp='1334805122' post='1398139']
Doesn't that make it obvious that the problem lies somewhere other than CUDA? What OS are you using and what's the (file) cache setup? Are you sure there is no disk thrashing during this copy? What if you copy into the same region multiple times? Is it that slow every time?
[/quote]
Indeed it does. As mentioned in my first post I am using Ubuntu 10.04.4, Linux Linux 2.6.32-40-generic, but I do not know what the (file) cache setup is. I doubt that there is a relation to the hard disk. I have run the copies in a loop, and it is always the same, always equally slow, as if a memcopy copies 64bit quantities in a loop (I know this because I have tried to cast the pointer to different types from unsigned char to unsigned long, and the result is that the time memcpy takes is the same as a loop copy of unsigned long's).

Well, bottom line is that somehow I do not manage to convey the contiguous nature of my memory area to CUDA nor to the Linux kernel. In addition, the poor performance could be the result of the Linux kernel not managing the reserved memory, or that reads/writes from/to that memory do not go through the cache, or other things, or a combination. One thing is sure, 50MB/s is slow even without cache.

Kind regards,
peter
[quote name='RezaRob3' date='19 April 2012 - 04:12 AM' timestamp='1334805122' post='1398139']

Doesn't that make it obvious that the problem lies somewhere other than CUDA? What OS are you using and what's the (file) cache setup? Are you sure there is no disk thrashing during this copy? What if you copy into the same region multiple times? Is it that slow every time?



Indeed it does. As mentioned in my first post I am using Ubuntu 10.04.4, Linux Linux 2.6.32-40-generic, but I do not know what the (file) cache setup is. I doubt that there is a relation to the hard disk. I have run the copies in a loop, and it is always the same, always equally slow, as if a memcopy copies 64bit quantities in a loop (I know this because I have tried to cast the pointer to different types from unsigned char to unsigned long, and the result is that the time memcpy takes is the same as a loop copy of unsigned long's).



Well, bottom line is that somehow I do not manage to convey the contiguous nature of my memory area to CUDA nor to the Linux kernel. In addition, the poor performance could be the result of the Linux kernel not managing the reserved memory, or that reads/writes from/to that memory do not go through the cache, or other things, or a combination. One thing is sure, 50MB/s is slow even without cache.



Kind regards,

peter

#10
Posted 04/19/2012 08:34 AM   
Hello,

Today I have carried out all sorts of test, the results of which I would like to share on this forum. I also attach the source code used, which should build easily on a Linux computer.

It appears that the performance depends on the size of the memory reserved at boot time and later mmap'ed into user space. I have carried out mempcy tests of a 16MB data block in three scenarios:

A: memcpy from mmap'ed to malloc'ed (through new operator)
B: cudaMemcpy from mmap'ed to cudaMalloc'ed (device memory), which is what I am mostly concerned about
C: cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed (device memory)

The resulting transfer time (and speed in MB/s) for the different sizes of reserved RAM to be mmap'ed is in the following table (two runs)

[code]
| 1GB | 4GB | 16GB | 64GB | 128GB | 188GB

A: | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)
B: | 4.494ms (3733.25MB/s) | 4.508ms (3721.65MB/s) | 4.706ms (3565.07MB/s) | 4.513ms (3717.53MB/s) | 111.657ms ( 150.25MB/s) | 112.231ms ( 149.48MB/s)
C: | 2.933ms (5720.16MB/s) | 2.949ms (5689.12MB/s) | 2.946ms (5694.91MB/s) | 2.951ms (5685.27MB/s) | 3.001ms (5590.54MB/s) | 3.022ms (5551.69MB/s)

A: | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)
B: | 4.37ms (3839.18MB/s) | 4.411ms (3803.5MB/s) | 4.456ms (3765.08MB/s) | 4.391ms (3820.82MB/s) | 111.562ms ( 150.38MB/s) | 112.09ms ( 149.67MB/s)
C: | 2.937ms (5712.37MB/s) | 2.949ms (5689.12MB/s) | 2.949ms (5689.12MB/s) | 2.954ms (5679.49MB/s) | 2.976ms (5637.51MB/s) | 2.985ms (5620.51MB/s)
[/code]

1. cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed is of course unaffected by the mmap (case C). The transfer is carried out in under 3ms, or 5.5GB/s which is quite good.

2. From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cashed somewhere.

3. There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy, but also when using cudaMemcpy. The latter appears to obtain still twice the performance of memcpy.

4. For reserved memory <= 64GB cudaMemcpy from mmap'ed to cudaMalloc'ed takes about 50% longer which is unexpected because the memory is known to be contiguous and the Nvidia device's DMA engine should be able to exploit that.

I would love to read an explanation for this behavior. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)

Kind regards,
peter
Hello,



Today I have carried out all sorts of test, the results of which I would like to share on this forum. I also attach the source code used, which should build easily on a Linux computer.



It appears that the performance depends on the size of the memory reserved at boot time and later mmap'ed into user space. I have carried out mempcy tests of a 16MB data block in three scenarios:



A: memcpy from mmap'ed to malloc'ed (through new operator)

B: cudaMemcpy from mmap'ed to cudaMalloc'ed (device memory), which is what I am mostly concerned about

C: cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed (device memory)



The resulting transfer time (and speed in MB/s) for the different sizes of reserved RAM to be mmap'ed is in the following table (two runs)





| 1GB | 4GB | 16GB | 64GB | 128GB | 188GB



A: | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)

B: | 4.494ms (3733.25MB/s) | 4.508ms (3721.65MB/s) | 4.706ms (3565.07MB/s) | 4.513ms (3717.53MB/s) | 111.657ms ( 150.25MB/s) | 112.231ms ( 149.48MB/s)

C: | 2.933ms (5720.16MB/s) | 2.949ms (5689.12MB/s) | 2.946ms (5694.91MB/s) | 2.951ms (5685.27MB/s) | 3.001ms (5590.54MB/s) | 3.022ms (5551.69MB/s)



A: | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)

B: | 4.37ms (3839.18MB/s) | 4.411ms (3803.5MB/s) | 4.456ms (3765.08MB/s) | 4.391ms (3820.82MB/s) | 111.562ms ( 150.38MB/s) | 112.09ms ( 149.67MB/s)

C: | 2.937ms (5712.37MB/s) | 2.949ms (5689.12MB/s) | 2.949ms (5689.12MB/s) | 2.954ms (5679.49MB/s) | 2.976ms (5637.51MB/s) | 2.985ms (5620.51MB/s)




1. cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed is of course unaffected by the mmap (case C). The transfer is carried out in under 3ms, or 5.5GB/s which is quite good.



2. From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cashed somewhere.



3. There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy, but also when using cudaMemcpy. The latter appears to obtain still twice the performance of memcpy.



4. For reserved memory <= 64GB cudaMemcpy from mmap'ed to cudaMalloc'ed takes about 50% longer which is unexpected because the memory is known to be contiguous and the Nvidia device's DMA engine should be able to exploit that.



I would love to read an explanation for this behavior. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)



Kind regards,

peter

#11
Posted 04/19/2012 05:10 PM   
Peter, you might get a quicker response if you make your example _tiny_(I know it's small, but make it tiny and remove ALL cuda related items) and then ask at the Linux Kernel Mailing List. Be sure to tell them your Kernel version and hardware setup(motherboard/CPU/RAM configuration.)
Peter, you might get a quicker response if you make your example _tiny_(I know it's small, but make it tiny and remove ALL cuda related items) and then ask at the Linux Kernel Mailing List. Be sure to tell them your Kernel version and hardware setup(motherboard/CPU/RAM configuration.)

#12
Posted 04/19/2012 08:24 PM   
Hello RezaRob, yes this is what I have done last night on stackoverflow. The first response mentioned cache size, even though it does not sound so plausible to me, as DMA transfer controlled by the Nvidia driver should be unaffected by cache. In addition, memory transfer exploiting cache would rather depend on the size of the memory transferred and not on the size of the memory reserved.

What is still a puzzle to me is that for reserved RAM < 64GB, where the performance is not too bad, a cudaMemcpy from mmap'ed to cudaMalloc'ed (device memory) takes 50% longer than cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed (device memory). My only explanation would be that the Nvidia driver, as it does not know about the contiguous nature of the mmap'ed memory, has to go trough all pages and either has to work out a DMA transfer schedule, or it has to do DMA on a page per page basis. In either case, would there be a way to tell the Nvidia driver that it should not worry and just assume that the mmap'ed are is indeed contiguous?

Kind regards,
peter
Hello RezaRob, yes this is what I have done last night on stackoverflow. The first response mentioned cache size, even though it does not sound so plausible to me, as DMA transfer controlled by the Nvidia driver should be unaffected by cache. In addition, memory transfer exploiting cache would rather depend on the size of the memory transferred and not on the size of the memory reserved.



What is still a puzzle to me is that for reserved RAM < 64GB, where the performance is not too bad, a cudaMemcpy from mmap'ed to cudaMalloc'ed (device memory) takes 50% longer than cudaMemcpy from cudaMallocHost'ed to cudaMalloc'ed (device memory). My only explanation would be that the Nvidia driver, as it does not know about the contiguous nature of the mmap'ed memory, has to go trough all pages and either has to work out a DMA transfer schedule, or it has to do DMA on a page per page basis. In either case, would there be a way to tell the Nvidia driver that it should not worry and just assume that the mmap'ed are is indeed contiguous?



Kind regards,

peter

#13
Posted 04/20/2012 10:26 AM   
[quote name='PeterW' date='20 April 2012 - 03:26 AM' timestamp='1334917571' post='1398655']
My only explanation would be that the Nvidia driver, as it does not know about the contiguous nature of the mmap'ed memory, has to go trough all pages and either has to work out a DMA transfer schedule, or it has to do DMA on a page per page basis. In either case, would there be a way to tell the Nvidia driver that it should not worry and just assume that the mmap'ed are is indeed contiguous?
[/quote]

How do you know that it's contiguous?
[quote name='PeterW' date='20 April 2012 - 03:26 AM' timestamp='1334917571' post='1398655']

My only explanation would be that the Nvidia driver, as it does not know about the contiguous nature of the mmap'ed memory, has to go trough all pages and either has to work out a DMA transfer schedule, or it has to do DMA on a page per page basis. In either case, would there be a way to tell the Nvidia driver that it should not worry and just assume that the mmap'ed are is indeed contiguous?





How do you know that it's contiguous?

#14
Posted 04/20/2012 10:12 PM   
[quote name='RezaRob3' date='20 April 2012 - 11:12 PM' timestamp='1334959963' post='1398887']
How do you know that it's contiguous?
[/quote]
Well, I reserve memory at boot time on the boot command line as "memmap=64G$128G"; my assumption is that this memory area set aside by the Linux kernel refers to a contiguous block of 64GB physical memory at offset 128GB. Of course, mmap will present this memory area in virtual address space to user space, and perhaps there the virtual memory pages are not contiguous any more. But I guess this does not matter, as the Nvidia driver will get the physical addresses back somehow and realize that the underlying physical memory pages are indeed contiguous. Maybe it is this process that takes time, and if the Nvidia driver allocates the contiguous memory itself, it does not have to do that at a later stage when it comes to DMA transfer.

For now I will have to live with the 50% performance hit in comparison to cudaMallocHost'ed memory, and with the fact that I can only use 64GB RAM.

Regards,
peter
[quote name='RezaRob3' date='20 April 2012 - 11:12 PM' timestamp='1334959963' post='1398887']

How do you know that it's contiguous?



Well, I reserve memory at boot time on the boot command line as "memmap=64G$128G"; my assumption is that this memory area set aside by the Linux kernel refers to a contiguous block of 64GB physical memory at offset 128GB. Of course, mmap will present this memory area in virtual address space to user space, and perhaps there the virtual memory pages are not contiguous any more. But I guess this does not matter, as the Nvidia driver will get the physical addresses back somehow and realize that the underlying physical memory pages are indeed contiguous. Maybe it is this process that takes time, and if the Nvidia driver allocates the contiguous memory itself, it does not have to do that at a later stage when it comes to DMA transfer.



For now I will have to live with the 50% performance hit in comparison to cudaMallocHost'ed memory, and with the fact that I can only use 64GB RAM.



Regards,

peter

#15
Posted 04/21/2012 08:53 AM   
  1 / 2    
Scroll To Top