Two concurrent HtoD copies in Titan X (Pascal) with 2 copy engines

I observe 2 concurrent HtoD copies at the same time in my project as shown in [url]https://i.ibb.co/g4dGZVV/image.png[/url]. (Screenshot from Nvidia Visual Profiler.)

As far as I know, Titan X (Pascal) has two copy engines for HtoD and DtoH memopy copy, one for each direction, and two concurrent memory copies on PCIe in one direction is not possible due to PCIe limitations. So why the profiling result above is possible?

I learned that copy engine is (probably) not envolved when the data transfer is less than 64KB.([url]https://devtalk.nvidia.com/default/topic/1027316/cuda-programming-and-performance/titan-v-announced-15-0-tflops-fp32-5120-cores-12-gb-hbm2-vram-3000-us-price/post/5226469/#5226469[/url]). Does anyone know what the underlying mechanism is?

The data for small HtoD copies can be sent as part of the command stream, which reduces latency. I think of it as in-band transport (data is sent along with the copy command) vs out-of-band transport (copy command kicks off a copy engine transfer). In analogy to processor instructions, one might also think of this as “immediate” data that forms part of a processor instruction.