Decoding using ffmpeg + cuda post processing

Our application does video processing using cuda.

I am trying use h264_cuvid codec for decoding. When receiving a decoded frame I am using cuMemCpyAsync to initiate a device to device transfer, The source memory was allocated by h264_cuvid codec using internal ffmpeg cuda context, the destination memory was allocated by using a cuda context I created in my application.
The data seems to be transferred through the host instead of device to device transfer, attached the Nsight timeline report.

[url]http://imgur.com/a/qce9k[/url]
You can see that the 900kb memory was transferred using Context 3 to the host and then transferred using Context 2 to the device.

It seems that cuda supplies cuMemcpyPeerAsync to copy memory between different contexts, however I can’t find a way to get the internal cuvid context that was used to allocate the memory

How can I avoid this host transfer?
Thanks.

I have tested cuMemcpyPeerAsync and still the transfer takes place using an intermediate cpu buffer.
Is it not possible to copy data between 2 contexts directly?

Thanks.