This is conflating correct behavior of the lock mechanism itself (ie. granting a lock to only one thread, and releasing the lock) with correct usage of the lock for some other purpose (to protect a critical section, and to enforce ordering of data read/write activity between disparate threads). As I stated, a threadfence may indeed be needed for other purposes.
The problem with this line of thinking is that it replaces programmer understanding of what is going on, with some idea that we can treat the lock as an appliance or black box. Rote ideas like “threadfence is required” or “is not required” are not enough for understanding.
Let me give you an example. The purpose of threadfence is to enforce ordering of memory access activity on an otherwise weakly ordered memory model. If you dispute this summary, please read the documentation.
I can demonstrate that the basic lock behavior (only one thread gets the lock at a time) is not affected by threadfence (and indeed, it cannot be, for the reasons already stated. atomics are sufficient, and if they were not, threadfence would not fix it).
When we introduce the notion that “threadfence is required on the unlock, otherwise bad things happen” is replacing programmer understanding with rote knowledge, which is a bad idea IMO. The premise has merit in the situation where:
- the locking thread is updating global memory
- the locking thread is not coordinating any other thread activity in the critical section
- the purpose of the lock is to allow critical-section updates to data that will be more-or-less immediately consumed by other threads
In that case, I agree that the threadfence prior to unlock is a good idea (perhaps mandatory for correctness), and while I haven’t studied the cuda by example codes recently, it might very well be the case that this is a bug of omission in the code examples.
However, that is one particular use case for a lock. There exist other use cases where the critical section update is done in such a way that there is no hazard (therefore no benefit from threadfence), and there exist yet other use cases where simply doing a threadfence in the unlock routine is not enough by itself to ensure correctness of the critical section updates, if multiple threads are involved. There is no reason that a single master thread, after acquiring a lock, could not coordinate the activity of multiple other worker threads, within the critical section.
In this case, the single thread threadfence in the unlock routine would not be sufficient for correctness.
So, due to the complexity of critical sections in CUDA, my general approach is I believe a conservative one, and that is to expect that the programmer will have the necessary knowledge to understand, construct, and use a lock correctly, rather than to enforce rules such as “the unlock must have a threadfence”.
I’ll go back to my previous statement, which I stand by for general usage and understanding: