CUDA 8 RC code generation related to warps updating __shared__ memory

Before I file a bug, I wanted to make sure that there is not a mistake on my part first.

I have noticed that the code generation for compute 52 (Maxwell) differs using CUDA 8.0 RC and CUDA 7.5. This is in terms of registers used and probably more ‘under the covers’. In my case if I compile my kernel for either 5.2 or 6.1 using CUDA 8.0 RC 32 registers are used, while if I compile the same kernel using CUDA 7.5 for 5.2 then 40 registers are used.

The bug I am seeing appears only using CUDA 8, and not at all using CUDA 7.5 for the exact same code using the exact same inputs.

This has to do with the properties of warps and the threads in a warp.

I cannot post the actual code because it is for a client but I think I can describe the issue;

There are a number of warps (lets say 8, so 256 threads per block) launched and warp has its own shared memory buffer which only that warp accesses.

At first each thread in a warp loads one 8 byte value from global memory into shared then each thread in that warp enters a loop where they read all values from that shared array and use those values to write out to global memory via atomics. There are no possibilities for warp divergence before this loop and there are no cases where 32 threads are not participating in the updates.

The problem is that using CUDA 8.0 RC running on a Pascal Titan X the answers are wrong by a small amount and that error amount varies each run.

With CUDA 7.5 I ran the code over 200 hundred times on two different machines with Maxwell GPUs and CUDA 7.5. On those machines there were no issues ever and the answer were always correct.

If I force the first thread in the warp (lane 0) to do all the shared memory updates and compile using CUDA 8.0 RC then this issue goes away, but the code is slower and more registers are used for compilation.

It does not matter if I qualify the shared memory as ‘volatile’ in this case.

I know the warp synchronous programming is discouraged, but for this kernel it is necessary.

I even used the following inline PTX after each thread in a warp does the update to shared which did not make a difference;

__device__ __forceinline__ void __bar_sync(const unsigned int id, const unsigned int threads){

  asm volatile("bar.sync %0, %1;" : : "r"(id), "r"(threads));

}

Each thread in a warp of 32 updates a float2 value in __shared__ memory
....

	<b>__bar_sync(local_warp_id,32);</b>

....Each thread then enters a loop where they all read from those __shared__ values(read only)

where the local_warp_id is the thread number divided by 32 (so in this case 0-7). As I understand this bar.sync should force the warp to wait until all threads have arrived at the barrier. Please correct me if I am wrong in this assumption.

I could post the SASS if that would help, but first want to make sure that;

  1. I did not make an incorrect assumptions about warp behavior
  2. If I did make a mistake then why with CUDA 7.5 does the implementation seem always return the correct answers?
  1. Is there any possibility for warp divergence in your code e.g. in the loop (yes/no). If so, where?
  2. Did you try using cuda-memcheck with the synccheck and racecheck subtools?

There is a slightly cryptic message at the end of

https://devtalk.nvidia.com/default/topic/934562/cuda-programming-and-performance/nvidia-pascal-titan-x-geforce-gtx-1080-gtx-1070-amp-gtx-1060/9

which says that shared memory is bugged in the latest driver. Could that be your issue?

Could be… Trying to verify via cuda racecheck tools.

A more general question I have for NVIDIA is how to sync a warp with a barrier so that all shared memory updates are completed and ready to be read by all the threads in that warp.

Using CUDA 8 RC the following does NOT work, though maybe I am using it incorrectly;

//each thread in a warp (lanes 0-31) updates a single adjacent 8 byte value in __shared__ memory

//then inline asm...

	 asm volatile(  "membar.cta;" );

//then all threads in that warp enter a loop where they read from the just updated __shared__ memory.

According the the PTX ISA documentation;

Section 8.7.21.2 of the same PTX document regarding membar:

“Waits for all prior memory accesses requested by this thread to be performed at the CTA, global, or system memory level. level describes the scope of other clients for which membar is an ordering event. Thread execution resumes after a membar when the thread’s prior memory writes are visible to other threads at the specified level, and memory reads by this thread can no longer be affected by other thread writes.”

In the end I am just trying to make a warp wait until all shared values are ready to be read by the threads in that warp (and only that warp).
In CUDA 7.5 it seems to do this correctly every time without any membar.cta or bar.sync, but with CUDA 8.0 RC the only way I can accomplish this is to force lane 0 to do all the shared memory updates.

any one of the following approaches, all are documented in the CUDA programming guide:

  1. __syncthreads(); // be sure to adhere to usage requirements

  2. mark the shared memory in question as volatile

  3. __threadfence_block();

I’ll add my $0.02…

Warp lanes are coherent.

The shared memory pattern you describe sounds legit to me and you should not need any barriers or synchronization.

Is it possible that the global memory region is being read/written by more than one warp?

It is possible there is a bug elsewhere in your code?

As I’ve noted before on this forum, nasty PTXAS bugs have been found in the past – especially at the highest optimization level.

As long as it is marked volatile.

The shared memory “pattern” described involves 2 salient features:

  1. All threads write to shared memory, albeit perhaps each thread is only writing to one location
  2. All threads read from shared memory, including reading from multiple locations.

One of the optimizations the compiler is allowed to make (unless you mark the shared memory as volatile), is to “optimize” a shared memory location into a register. This can persist for as long as the compiler deems appropriate, unless a particular memory barrier is hit, or execution-barrier-with-a-memory-barrier-built-in.

volatile should prevent the compiler from making said optimizations.