I am running samples/1_Utilities/bandwidthTest. I see great performance for DeviceToHost and HostToDevice operations.
However, if instead of RAM I am using MMIO space of some other device, the performance drops dramatically. I also observe, that instead of using DMA in this case, the CPU is used (which actually makes the performance so bad…) Is it possible to force cuda to still use DMA instead of CPU to do that copy?
The basic OS driver model generally prevents PCI device A from writing directly to a buffer owned by PCI device B without doing some special things (in the drivers).
If you want to transfer data directly to/from a PCI device that is on the same PCI fabric as a GPU, then the defined method for that is GPUDirect RDMA:
This assumes you have access to the driver source code for your device and are a reasonably proficient driver writer for the OS in question.
Unless you’ve done that, CUDA cannot write directly to your device, but instead will write to system memory, and if that memory is not pinned, then maximum transfer speed cannot be achieved.