Hi all,
Why inter-block synchronization is not supported by CUDA and NVidia hardware?
We can use atomic or global memory to implement inter-block synchronization, but they are not efficient. Furthermore, they don’t guarantee memory consistency unless using __threadfence or volatile.
I suspect “don’t hide power” is one of the design hints that has kept features like inter-block synchronization (and many many others) from being implemented.
Inter-block synchronization was probably determined to be expensive and a stylistic mismatch with an architecture that might have dozens (hundreds?) of multiprocessors.
So it’s probably missing for a reason… and the implication might be to find another way of accomplishing your goal or implement it yourself and accept the low performance.
That being said, Kepler devices have a low multiprocessor (SMX) count compared to GT200 devices so perhaps performance wouldn’t be that bad. There is a well-known implementation here.
Thanks for your reply. I agree that the HW overhead might be expensive to support inter-block synchronization. The fact is that, for some HPC applications, people try to use SW inter-block synchronization to reuse on-chip data instead of terminating the kernel, in this case, even if it is not that slow on Kepler, SW inter-block synchronization could become the bottleneck.
But the hardware does support block-wide synchronization… it’s a kernel! This isn’t a facetious answer, but the real way to sync between blocks. There’s not too much overhead from a new kernel launch especially when it’s queued up in the same stream.
But probably what you want is “please don’t exit my kernel, just run all blocks up to this point and continue.” Kind of a __syncblocks() in addition to __syncthreads(). But that pretty much requires that context (registers and shared) be dumped to slow device memory in order to free the resources needed to run more blocks up to the sync point, then after all blocks are run, to restore all that state per block. Is that much different than queuing up a new kernel and just restoring state yourself?
As I said, the difference between terminating the kernel and inter-block synchronization is that you will lose on chip data if you terminate the kernel. For some memory-bound application, it is critical to better utilize on-chip memory.
This is basically the only way that I could imagine doing dynamic parallelism, since a block can launch an arbitrary number of new blocks and synchronize on the completion of them.
I am not familiar with DP. So if there are too many blocks, DP will consume a lot of DRAM space and also become slower. In this situation, it would better to terminate the kernel and launch another from CPU side, right?
That’s true, but for most CUDA jobs, the size of thread state is not nearly as large as the memory used for input or output data from the kernel. I don’t think many dynamic parallelism jobs are at risk of running out of DRAM.
I was actually not talking about the “out of DRAM” issue. I mean it might spend some time on context switch if there are too many blocks. So I wonder how is the performance when an application uses DP.