CPU operation is very slow on memory allocated by cudaMallocHost

The speed of copying data between GPU and CPU is faster when I use cudaMallocHost(rather than malloc) to allocate host memory(let’s say hostMem).

However CPU operation on hostMem is much slower, is there a method I can allocate memory that could make copying faster but doesn’t slow CPU operation?

Thanks in advance.

Hi,

Have you maximized the device performance first?

sudo ./jetson_clocks.sh

Thanks.

Yes, I did.
I found from some other topics that pinned memory(allocated by cudaMallocHost) didn’t use cache which is the reason why CPU operation is slow on pinned memory.

Hi,

YES.

It’s recommended to use unified memory for Jetson.
You can check this document for the memory management on Jetson:
[url]https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management[/url]

Thanks.

I am working on unified memory, but someone said it’s not supported on TX2, some documents said it’s supported on TX2, do you have definite idea about this?

It is supported.

Thanks, I have implemented unified memory, but encountered Bus error(core dump), I read some articles saying about exclusive access of GPU and CPU.

My program has multi-tread, GPU and CPU will need to access different part of the unified memory, is this reason I have Bus error(core dump)? If yes, do I have alternative that can allow me to use unified memory and multi-thread?

Unified memory means that you can access it from CPU or GPU without copying, but accessing it at the same time from both is usually a bad idea. What would you expect to be the result if writing from both sides at the same time ?
You may rather allocate several buffer with unified memory.
For example you may receive data from camera on CPU side and store it into buffer 1, while GPU is processing buffer2 and buffer3 is displayed (from GPU or CPU side).
Once frame has been processed on each, you would then receive data into buffer3, GPU would process buffer1 (where the previously acquired is), and display would read buffer2 (previously processed by GPU), and so on. This is just a basic example, it may be more complex depending on your use case.

Thanks Honey_Patouceul.

  1. Current situation is that I do need to access unified memory by GPU and CPU at the same time, like I logically partition the unified memory to 10 parts, GPU and CPU will access unified memory at the same time but different parts.

  2. Your method “You may rather allocate several buffer with unified memory” sounds workable, but how can I allocate several buffers with unified memory? For example:

for(int i=0; i<10; i++){
    cudaMallocManaged(&unified_buffer[i], memSize);
}

From my understanding, even there are 10 starting pointer in above code, but they are all regarded as one unified memory, like GPU is accessing unified_buffer[2], could CPU access unified_buffer[1]?
(and above code is my current implementation and I did have Bus Error(core dump))

Could you shed some light upon how I can do several buffer with unified memory, so GPU could work on unified_memory[2], and CPU and work on unified_memory[1].

Thank you very much.

In my case, CPU and GPU need to access unified memory which is not supported by TX2 hardware, do you have other methods that could help?

Hi, heyworld

Both CPU and GPU can access unified memory.
You can find some information in this document:

Could you share more detail about your use case?
So that we can share a further suggestion for you.

Thanks.

Hi AastaLLL,

Sorry, what I mean is in my case, CPU and GPU need to access unified memory at the same time. Multi-threading is used in my case, CPU and GPU will access unified memory at the same time but different address of unified memory.

Hi,

Concurrent access is not supported on TX2 but available on Xavier:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

Simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. However, devices of compute capability 6.x on supporting operating systems allow the CPUs and GPUs to access Unified Memory allocations simultaneously via the new page faulting mechanism. A program can query whether a device supports concurrent access to managed memory by checking a new concurrentManagedAccess property. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors. ----------------------------------------------------------------------------------------------------------------------

Thanks.