Our application does video processing using cuda.
I am trying use h264_cuvid codec for decoding. When receiving a decoded frame I am using cuMemCpyAsync to initiate a device to device transfer, The source memory was allocated by h264_cuvid codec using internal ffmpeg cuda context, the destination memory was allocated by using a cuda context I created in my application.
The data seems to be transferred through the host instead of device to device transfer, attached the Nsight timeline report.
[url]http://imgur.com/a/qce9k[/url]
You can see that the 900kb memory was transferred using Context 3 to the host and then transferred using Context 2 to the device.
It seems that cuda supplies cuMemcpyPeerAsync to copy memory between different contexts, however I can’t find a way to get the internal cuvid context that was used to allocate the memory
How can I avoid this host transfer?
Thanks.