Official answer: yes, you can talk about the 2.2 beta. If you have bug reports, please make sure to file a bug in the registered developer site (in addition to any prodding you want to do on the forums).
Anyway…
Zero-copy is somewhat confusing when you first look at it, but it might be the most powerful thing we’ve exposed in CUDA. Zero-copy plus pinned memory shared across contexts (another magical 2.2 feature) is a giant cannon that somebody is going to use for some ridiculous application.
First, the caveat. CUDA is currently limited to a 32-bit address space, and zero-copy is done per-process, not per-allocation, so any pinned memory allocation will also be a zero-copy allocation (which uses address space) when the appropriate context flag is set. We’re looking at removing this per-context limitation in the future.
Let’s split zero-copy discussion into two separate buckets: MCP79, the easy case, and GT200, the more complicated case.
MCP79: Zero-copy here implies two things–zero-copy and copy elimination. MCP79 will use any memory on the host directly, so this is really good for low-latency applications. There’s no PCIe traffic or anything like that now, sysmem is used directly because MCP79 is the chipset. It’s absolutely ridiculous. The only reason to not use zero-copy on an MCP79 is because of the 32-bit address space limitation, so in reality, you will pretty much always use zero-copy on MCP79. If you are an audio guy, please write something using zero-copy on MCP79–I’ve really wanted to do this, but I haven’t had time. I expect its perf compared to other things in this segment to be mind-blowing.
GT200: The big complicated case.
When you use zero-copy on GT200, the SM will perform a memory fetch across PCIe directly. The accessed area will not touch global memory or anything like that–it goes straight from PCIe into the SM. If you remember TurboCache from the GeForce 6 timeframe, this is a lot like that. Bandwidth between DRAM and PCIe is additive–now you’ve got ~80GB/s of DRAM bandwidth + ~6GB/s of PCIe bandwidth to play with on a GT200.
However, there’s another side to zero-copy–latency. To answer the OP’s question, you can never totally hide PCIe latency. Even if you’ve got perfect overlap and all of your cudaMemcpyAsyncs are hidden by kernel executions, you still have the initial memcpys to the device before you can start executing (plus the last memcpy you have to do). Zero-copy may be faster for these things–depends on your access pattern and any number of variables. Our internal tests have shown that while kernel execution time certainly does increase versus accessing everything in DRAM, the fact that you are doing this in the SM, which is a device whose fundamental task is to hide memory latency while doing computation, can give you really effective latency hiding, so it can offer surprising performance advantages. I’ve been trying to get a week free to bang on it and figure out when exactly it’s useful (e.g., I imagine it’s quite useful in some BLAS calls when you’re limited by memory bandwidth to begin with), so I’m very interested in what people discover with it.
PS: you can do cudaMemcpyAsync and zero-copy at the same time. They will slow each other down since you’ve only got so much PCIe bandwidth to play with in the first place, but something to keep in mind…
PPS: also keep in mind that there are all sorts of read-after-write hazards associated with zero-copy. If you write to the region on the CPU and expect it to be immediately visible to the GPU, this is probably PCIe controller dependent. Same going in the other direction. The only thing we guarantee is that if you write to a PCIe location in one thread and read it later from that same thread, you’ll see the updated value.