glReadPixels to PBO - avoiding implicit glFinish

I am trying to read back the data from the framebuffer color data to CPU memory with as little interruption to framerate as possible, using OpenGL 3.2 on a discrete desktop card (GTX 760, Win10 x64, driver 385.41.)

I read a number of sources that suggest the fast path is to create a GL_PIXEL_PACK_BUFFER object (PBO), and bind that before calling glReadPixels. This is supposed to cause an asynchronous transfer, where you can do some work in the mean time, before you call glMapBuffer. For now I am only doing this on key press (as if taking a screenshot) so I do not have ping-pong FBOs set up – I call glReadPixels, create a fence, wait for the fence, and then call glMapBuffer.

Now, it seems like the transfer glReadPixels starts is indeed asynchronous, because if I create a fence after calling glReadPixels, it takes a reasonable small-ish amount of time to be signaled (3-4 ms on a 1280x720 framebuffer, which is ~1 GB/s) and getting data out of the PBO functions properly. However, glReadPixels seems to perform an implicit glFinish and wait for pending GPU work to complete, because it takes 16+ ms! If I glFinish() first followed by glReadPixels(), glFinish() is 16+ ms and glReadPixels() is almost instant, so it does indeed seem to be waiting on GPU work.

My question is, how can I queue a copy from the GPU framebuffer to CPU memory without the CPU stalling waiting for the GPU just to initiate the transfer? I recognize that the CPU has to wait until the copy completes for data to transfer, but I don’t understand why the CPU has to wait for the GPU to finish current rendering before it can initiate the transfer.

What I am imagining I want to do at the hardware level is to queue a GPU command to “copy framebuffer to on-GPU PBO memory” after my drawing commands for that frame, then use a fence to wait until that command completes.

Is there a different API or idiom I should be using here? It seems like something is a little off to me.

EDIT: I should note I also get this driver diagnostic message from glReadPixels:
“Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.”

I am also noticing that the time spent in glReadPixels is significantly reduced to < 1 msec if the app runs either fullscreen or in “fullscreen borderless windowed” mode as opposed to windowed mode (I am using SDL2 to manage initializing the GL context.) I don’t know if that’s a timing quirk, but it seems very repeatable – possibly some sort of interference with desktop compositing?