Do I need threadfence?

I know that __threadfence() blocks until memory writes are visible to all threads etc.

But consider the following scenario:

  1. If each thread only writes and reads into its own portion of shared or global memory, do I need to call threadfence before reading?

  2. If I try to read a global or shared variable after a (possibly different) thread has written to it using an atomic operation, do I need threadfence().

  3. I can’t use atomics with volatile variables. I assume this is because atomics flush the cache anyway. Is this true?

I’m under the impression that threadfence is not required under the above scenarios but have no hard data to confirm it.

Thank you.

No because no contention here.

No because atomic operation (read-modify-write) is stronger than threadfence.

threadfence guaratees order of executions of the thread who issues threadfence, it does not affect behaviour of OTHER threads.

Great, thanks.

Quick question here - When you say threadfence doesnot effect behavior of OTHER threads - does it mean that a threadfence actually only makes the CALLING thread itself till its write is visible to everyone else and doesnot really make other threads to wait?

If that is the case then is there a way to actually implement a race free scheme - where in a group of threads (grpB) need read something that another grp (grpA) of threads write. Also before initializing there read sequence each thread of grpB is spinning on a variable(flag) which ONE of the grpA threads set. - hence grpB should wait for the write of grpA to be visible.

Does CUDA programming construct gives us such ability?

Depend if your threads are on the same warp (or half-warp it’s architecture dependent)! You need that all your grpA threads to be on the same half-warp to be safe, and same for grpB threads, but threads of grpA and grpB should not be in same Warp, but should be in the same block! Anyway it’s architecture-dependent, and would not engage you to think this way!

I would have done a quick simple Reduce to count the threads that have finished writing, to ensure they are all synchronized (especially if they are in different blocks).