Cuda Device to Device Copy with Host Side Synchronization

Hi, I am doing device to device cudaMemcpy. In my code I have created two threads.
Thread 1 - Copies data from device memroy to another device memory
Thread 2 - Operates on this copied memory.

On CPU program, how should I come to know that Thread-1 has completed the memcpy job before I instruct another thread to process on the latest data and not on the previously holded data / junk data in the buffer.

As per CUDA Driver API :: CUDA Toolkit Documentation, “For transfers from device memory to device memory, no host-side synchronization is performed.”, so can you please help me in understanding how to handle this situation. If you point out to any reference code, that would be helpful.

Thanks,
Tushar