Implicit synchronization

Hi everyone,

I was not getting the expected concurrency when using multiple streams and realized the issue comes from a restriction detailed in chapter 3.2.5.5.4. Implicit Synchronization of the Programming guide:

"Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations described in Compute Capability 2.x and Compute Capability 3.x."

The issue is that I allocate pinned memory and device memory between two sets of operations in different streams. The solution is quite simple; I just need to allocate memory beforehand. But that implies that you know in advance exactly how much memory you need or that you have some sort of complex memory management mechanism.

My question therefore is, is this restriction likely to be removed in a future release?
If so, would anyone else like to see a functionality where memory allocations (and deallocations) could be streamed just like transfers and kernels? This way, memory could be used only where and when it is needed.

It’s not likely to be removed in a future release. Memory allocations cause a modification to the GPU virtual memory map. Updating GPU virtual memory map must be done when kernels are not running, thus the need for a device sync.

I actually agree with Dude1205 that this behavior is annoying and it would be great if NVIDIA fixed this. You don’t expect std::malloc/std::free on a CPU to block on other threads.

Note that you can work around this limitation pretty easily by implementing your own malloc/free. cudaMalloc/cudaFree perform heavy-weight synchronizations and update GPU virtual memory maps, so you don’t have to try very hard to write a faster malloc/free. Just rounding up allocations to a pool size, sticking them in a std::map, and splitting/merging them on malloc/free calls is significantly faster and avoids the synchronizations.

The downside of doing this is that 1) applications will often use more GPU physical memory on average because free won’t immediately return allocations, 2) the memory checking tools will have a harder time detecting out of bounds accesses, and 3) since kernels are executed asynchronously, you need to make sure that calls to free are scheduled after kernels complete (a straightforward solution is to queue the free in the same stream).

Does “device memory set” include the cudaMemset*Async() functions? cudaMemset() uses the NULL stream, so I can see why that would force implicit synchronization, but why should cudaMemsetAsync() on a non-NULL stream force implicit synchronization?

@Gregory Diamos: Yes. It does not make sense to me that the GPU has to be completely idle in order to allocate new memory. I understand it might be simpler to implement virtual memory maps with this restriction. But in theory, lifting this restriction should be feasible.

Isn’t it already the case for device-side malloc anyway?

I think by definition device-side malloc does not required the GPU to be idle. It is allocated out of a reserved fixed-size heap in global memory, so you just get a virtual address that has already been mapped. It’s interesting to note that you could perform memory allocation asynchronously by wrapping device-side malloc/free in kernels.

Except that pointers returned by device-side malloc are not usable for transfer of data to/from the host (e.g. via cudaMemcpy). That may or may not be important. Regarding the initial problem statement in this thread it might be important.