Inter-block synchronization

Hi all,
Why inter-block synchronization is not supported by CUDA and NVidia hardware?
We can use atomic or global memory to implement inter-block synchronization, but they are not efficient. Furthermore, they don’t guarantee memory consistency unless using __threadfence or volatile.

Thanks,
Xuhao

I suspect “don’t hide power” is one of the design hints that has kept features like inter-block synchronization (and many many others) from being implemented.

Inter-block synchronization was probably determined to be expensive and a stylistic mismatch with an architecture that might have dozens (hundreds?) of multiprocessors.

So it’s probably missing for a reason… and the implication might be to find another way of accomplishing your goal or implement it yourself and accept the low performance.

That being said, Kepler devices have a low multiprocessor (SMX) count compared to GT200 devices so perhaps performance wouldn’t be that bad. There is a well-known implementation here.

Thanks for your reply. I agree that the HW overhead might be expensive to support inter-block synchronization. The fact is that, for some HPC applications, people try to use SW inter-block synchronization to reuse on-chip data instead of terminating the kernel, in this case, even if it is not that slow on Kepler, SW inter-block synchronization could become the bottleneck.

But the hardware does support block-wide synchronization… it’s a kernel! This isn’t a facetious answer, but the real way to sync between blocks. There’s not too much overhead from a new kernel launch especially when it’s queued up in the same stream.

But probably what you want is “please don’t exit my kernel, just run all blocks up to this point and continue.” Kind of a __syncblocks() in addition to __syncthreads(). But that pretty much requires that context (registers and shared) be dumped to slow device memory in order to free the resources needed to run more blocks up to the sync point, then after all blocks are run, to restore all that state per block. Is that much different than queuing up a new kernel and just restoring state yourself?

Don’t forget the nasty problem of synchronizing the running blocks of a grid that doesn’t fit entirely onto your device’s multiprocessors. :)

And that’s probably one of the strongest reasons why the primitive does not exist and why block sync should remain implicit (as @SPWorley points out).

Dynamic Parallelism might be right way of accomplishing block sync – it keeps synchronization implicit and, when required, can push/pop SMX state.

[ I’ve been resisting digging into DP because it seems like it will be a one-way trip for me and I’ll permanently punt on supporting Fermi devices! ]

As I said, the difference between terminating the kernel and inter-block synchronization is that you will lose on chip data if you terminate the kernel. For some memory-bound application, it is critical to better utilize on-chip memory.

So you think DP is implemented by pushing the SMX state into DRAM, and then launching another kernel, after which resuming the former state from DRAM?

This is basically the only way that I could imagine doing dynamic parallelism, since a block can launch an arbitrary number of new blocks and synchronize on the completion of them.

I am not familiar with DP. So if there are too many blocks, DP will consume a lot of DRAM space and also become slower. In this situation, it would better to terminate the kernel and launch another from CPU side, right?

That’s true, but for most CUDA jobs, the size of thread state is not nearly as large as the memory used for input or output data from the kernel. I don’t think many dynamic parallelism jobs are at risk of running out of DRAM.

I was actually not talking about the “out of DRAM” issue. I mean it might spend some time on context switch if there are too many blocks. So I wonder how is the performance when an application uses DP.