Reading memory mapped pointer from 3rd Party PCIe Device via cudaHostRegisterIoMemory without CPU Caching?

Hey experts!

I am currently memory mapping a netlist ev3 card to the GPU via:

cudaHostRegister( ptr, size, cudaHostRegisterIoMemory );

Currently, if I pass a GPU device pointer from GPU A to GPU B, CUDA instantiates an automatic DMA P2P transfer from GPU A to GPU B without getting cached on the CPU. I want this same behaviour but with the netlist ev3 card, such that I can pass the memory mapped pointer from the ev3 card to the GPU kernel, and when the GPU reads that address, have it get directly transferred to the GPU without getting cached on the CPU.

I did some benchmarks on reading the mapped pointer from the netlist ev3 card but it runs at ~12GB/s, however it is on a PCIe 3.0 x4 slot which means it must be getting cached on the CPU.

Is there a way to bypass this caching?

Thanks :)

GPUDirect RDMA

[url]GPUDirect RDMA :: CUDA Toolkit Documentation

assuming you can write a driver for the 3rd party card.

Hmm, I kind of wanted to prevent doing that. I was thinking… at some point in the driver it has to detect if the pointer is from another GPU in order to know whether or not it can DMA transfer the data from the other GPU to itself right?

Therefore, is it possible to “fake” the pointer’s attributes such that you give it all the same properties as a GPU pointer and then when it goes to read it, it DMA tranfers from the netlist card to itself in the same way it would from another GPU?

Cheers

Doing what you describe typically requires modifications to the 3rd part device’s driver. If you’re on Linux and this is a character or block device, there is likely a layer of system memory caching being done either by the OS or the EV3’s driver that’s leading to the staging through CPU system memory.