(Errata for CUDA by Example) Is atomicCAS() safe to simulate lock even with __threadfence()?

mohammad_majharul · August 2, 2017, 10:51pm

I see the Cuda by Example - Errata Page have updated both lock and unlock implementation (p. 251-254) with additional __threadfence() as “It is documented in the CUDA programming guide that GPUs implement weak memory orderings which means other threads may observe stale values if memory fence instructions are not used.”

According to my understanding this new implementation has yet another bug.

Let me explain a possible scenario where this synchronization mechanism might fail and violate mutual exclusion.

Suppose thread 0 from block 0 does atomicCAS on mutex and set it to 1. Before it finish executing the __threadfence() to make sure all the threads from other blocks see the updated value of mutex what if thread 0 from block 1 (also trying to grab the lock) read mutux (stale ) value 0 and then set it to 1. Then both of the threads will enter into the critical section and violate mutual exclusion.

This could be avoided by executing atomicCAS and __threadfence() both atomically which is not the case and not sure if possible either.

Please let me know if my understand is correct. Also let me know if I should post the lock and unlock implementation here.

Thanks in advance.

Cuda by example book pdf link-> http://www.hds.bme.hu/~fhegedus/C++/CUDA.by.Example.pdf
(p. 251-254)

Robert_Crovella · August 3, 2017, 1:52am

atomicCAS, when used by multiple threads, cannot pick up a stale value. global atomics bypass the L1 and are resolved in the L2 cache (for kepler and beyond, anyway), which is a device-wide resource.

So if thread 0 does an atomicCAS, and acquires the mutex, then any other thread doing an atomicCAS will not acquire the mutex. There is no need to also do threadfence to make this work

If the other thread does some sort of ordinary read of the mutex value, it may pick up a stale value. Nevertheless an ordinary read is not part of the mechanism to acquire the mutex. The additional threadfence will make it less likely that an ordinary read or unprotected read by any thread would pick up a stale value.

Of course any ordinary write to the mutex by a thread other than the mutex controlling thread would completely destroy the mutex semantics.

mohammad_majharul · August 6, 2017, 7:51pm

Thank you for explaining.

mohammad_majharul · August 12, 2017, 9:57pm

Hello txbob,

I understand that we don’t need threadfence insde lock as you’ve explained already.

However, shouldn’t we still enforce a threadfence inside unlock before atomicExch() so that the (non-atomic) updates/writes to the global variable are made visible to the rest of the threads since critical sections are meant to be mutually exclusive as well as the updates done inside should be global.

Thank you.

Robert_Crovella · August 12, 2017, 11:32pm

threadfence is not needed to make a global atomic lock mechanism itself function properly. It may be needed for other purposes.

mohammad_majharul · August 14, 2017, 6:26pm

Thank you

tyler.sorensen · September 8, 2017, 1:21pm

Threadfence in unlock is definitely needed for a correct global atomic mutex. The examples in CUDA by Example have been observed to produce incorrect results without it (tested on Kepler and Maxwell GPUs).

Threadfence in lock is more controversial. AFAIK, there hasn’t been observed failures if it is omitted, and CUDA documentation isn’t clear enough to understand if it is needed. However, other modern languages (e.g. OpenCL, C++) require some kind of “fence” in lock. It’s safest to include it.

There’s no need for the CAS and threadfence to execute atomically though (as txbob pointed out).

Robert_Crovella · September 9, 2017, 12:01am

This is conflating correct behavior of the lock mechanism itself (ie. granting a lock to only one thread, and releasing the lock) with correct usage of the lock for some other purpose (to protect a critical section, and to enforce ordering of data read/write activity between disparate threads). As I stated, a threadfence may indeed be needed for other purposes.

The problem with this line of thinking is that it replaces programmer understanding of what is going on, with some idea that we can treat the lock as an appliance or black box. Rote ideas like “threadfence is required” or “is not required” are not enough for understanding.

Let me give you an example. The purpose of threadfence is to enforce ordering of memory access activity on an otherwise weakly ordered memory model. If you dispute this summary, please read the documentation.

I can demonstrate that the basic lock behavior (only one thread gets the lock at a time) is not affected by threadfence (and indeed, it cannot be, for the reasons already stated. atomics are sufficient, and if they were not, threadfence would not fix it).

When we introduce the notion that “threadfence is required on the unlock, otherwise bad things happen” is replacing programmer understanding with rote knowledge, which is a bad idea IMO. The premise has merit in the situation where:

the locking thread is updating global memory
the locking thread is not coordinating any other thread activity in the critical section
the purpose of the lock is to allow critical-section updates to data that will be more-or-less immediately consumed by other threads

In that case, I agree that the threadfence prior to unlock is a good idea (perhaps mandatory for correctness), and while I haven’t studied the cuda by example codes recently, it might very well be the case that this is a bug of omission in the code examples.

However, that is one particular use case for a lock. There exist other use cases where the critical section update is done in such a way that there is no hazard (therefore no benefit from threadfence), and there exist yet other use cases where simply doing a threadfence in the unlock routine is not enough by itself to ensure correctness of the critical section updates, if multiple threads are involved. There is no reason that a single master thread, after acquiring a lock, could not coordinate the activity of multiple other worker threads, within the critical section.

In this case, the single thread threadfence in the unlock routine would not be sufficient for correctness.

So, due to the complexity of critical sections in CUDA, my general approach is I believe a conservative one, and that is to expect that the programmer will have the necessary knowledge to understand, construct, and use a lock correctly, rather than to enforce rules such as “the unlock must have a threadfence”.

I’ll go back to my previous statement, which I stand by for general usage and understanding:

tyler.sorensen · September 10, 2017, 8:29am

Thanks for the discussion txbob. Here’s a few more thoughts I have.

Isn’t that the purpose of libraries though? That is, programmers don’t have to understand every line of the implementation and instead just understand the specification. I wouldn’t expect a programmer to understand the C standard library to use it. Can’t a lock be the same?

I agree 100%, but there’s no reason fences can’t be encapsulated in a library to hide the complexities of modern weak memory models. A proper lock implementation will provide the SC for DRF property by using fences in lock and unlock.

Do you mean you can demonstrate this with a program or just architectural description? I can’t think of a program that can demonstrate this without using a threadfence. Because to empirically (from the program’s output) determine mutual exclusion (only one thread gets the lock at a time), I believe you need some sort of global memory store to a value disjoint from the lock. Once you do that, you have weak memory and need a fence.

See above. I would be really interested in hearing about such a case.

This one is more clear to me. For example, if not all shared memory* location accesses are protected by a mutex I could see this. However, this would then be a case of a data-race and one could argue that the mutex isn’t being used correctly. I haven’t seen any program actually use this idiom and I would be interested in an example.

*shared memory meaning memory locations that are accessed by different threads. Not CUDA shared memory.

Critial sections in CUDA do not have to be complicated though. The fences as I describe can be used to implement mutexes with the same mutex specifications of C++, Java, OpenMP etc. Programers with intuition from those languages could then use those intuitions in CUDA as well. No other language that provides locks also requires programmers to add fences in the way that it seems like you are suggesting.

Programming concurrency is hard :) locks are nice abstractions which can ease the burden and hide the insanely complicated weak memory model by encapsolating fences in their implementation.

Thanks again for the discussion!

Robert_Crovella · September 10, 2017, 9:01pm

Who said anything about libraries? The original code is not in a library. If this were a well written library, with reasonably good testing and a well-defined set of conditions under which it is valid, I would have no argument. But the original code is not a library. In fact I would say it looks nothing like a library. Furthermore it’s presented in a book about learning CUDA. Not about how to abstract away the difficulties, after you’ve understood the concepts. Therefore I wouldn’t treat it the same way I would treat a library reference. If someone says, “this ought to have a threadfence”, and they understand all the concepts, then go for it.

I’m pretty sure I can demonstrate something that proves only one thread has a lock at a time, without any other memory considerations. I can certainly do it with some other memory considerations that are not threadfence (e.g. atomics or volatile). I’m sure if I wrote something you would argue that its not what you intended. So I see little point into getting into that. It’s OK if you disagree with me. It’s the nature of community. I never once said memory ordering didn’t have to be attended to. I suggested it’s a different concept than ensuring a unique thread wins a lock. If you disagree, and feel they ought to be treated together, its OK.

CUDA doesn’t provide locks. What is presented was straight line CUDA code, not a library call, and was presented in a particular book, for a particular purpose, and the author apparently didn’t even get it right for that purpose.

If you pick up that code from that book, and feel that it ought to have a threadfence, feel free to add one.

Stating it ought to have one is a matter of opinion. I have mine, you have yours. It’s OK.

tyler.sorensen · September 12, 2017, 9:12am

There isn’t much talk about threadfence on these boards, so I think discussions like this are really valuable. Thanks for taking the time txbob.

You might be right, but I think we could have a really nice discussion around this :)

I think this summarizes nicely the differences of our opinions.

I think though, that there are two things that are not opinion that we could hopefully agree on (please let me know if not!). I want to briefly try to summarize these:

The concrete lock applications in CUDA by Example are broken, leading to observably incorrect results on current GPUs. The fix is to use threadfence in a way that is semantically equivalent to the description in the errata.
If a programmer wants to use the lock.h file from CUDA by Example (which implements a standard mutex API) and use this to write data-race free code, then encapsulating the threadfence in the mutex implementation (as I described earlier) is sufficient to achieve the memory ordering properties found in other languages. That is, a data-race free application will not have to (and in fact, can not) use additional fences to ensure correctness.

Robert_Crovella · September 17, 2018, 5:34pm

After some additional exposure to the ideas expressed here, I’ve come to the conclusion that my previous position(s) were not correct. It’s not a good idea to try to separate lock behavior from what the lock is intended to accomplish. A well-written lock function should provide some level of fencing.

I think that is a key point being made by other contributors to this thread, and I would like to amend my previous comments and concur with that. Also, with respect to pointing out defects in the code presented in CUDA by Example, I should not have obscured that or diminished those valid points. Rather than trying to edit my previous comments and create incoherence in this thread, I’d like this comment to reflect that my positions above were incorrect, for anyone who may visit this thread in the future.

For such visitors, you may also be interested in a new addition to the PTX manual in CUDA 9:

[url]https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#memory-consistency-model[/url]

This addition provides a more formal statement of the CUDA memory model than has previously existed. Albeit at the PTX level, of particular interest to those investigating or possibly writing CUDA locks might be section 8.7:

[url]https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-acquire-patterns[/url]