inner wrap dead lock

Hi,
here is the example code:

device void waitFlag(int *flag){
int val = atomicCAS(flag,0,1);
if(val==1){
do{
val = atomicAdd(flag,0)
}while (val==1);
}
else{
atomicExch(flag,2);
}
}
Image this case, thread A and thread B are in the same wrap,
At the very beginning, the flag is set to 0,
Then both thread A and thread B entry into function waitFlag.
At the atomicCAS, we assume thread A the set *flag success, so after that the val in thread A is 0, and in thread B is 1.
As thread A and thread B are in the same wrap, so we assume thread B execute the follow code first, so thread B is active, and thread A is divergent.
While thread B will in the do…while loop until flag is set to 2,
But only thread A can set the flag to 2, and thread A is divergent.
So the dead lock happen.
Am I right?
looking forward to any reply.
Thanks.

There is no guarantee of that in the CUDA programming model that I am aware of. There is certainly a warp divergence point on the if-condition:

if(val==1){

but whether the compiler schedules the “then” or the else-body next, is undefined.

Aside:

The code looks unpredictable to me. I’m not sure why you would want to go down this avenue if you’re interested in reliable code. Attempting to negotiate for locks between threads in a warp generally is tricky. The Volta execution model improves this, but does not resolve the condition I stated at the beginning of my posting. When trying to assess behavior of warp-locking code like this, it is sometimes necessary to look at the generated SASS:

https://stackoverflow.com/questions/31194291/cuda-mutex-why-deadlock/31195230#31195230