Before I file a bug, I wanted to make sure that there is not a mistake on my part first.
I have noticed that the code generation for compute 52 (Maxwell) differs using CUDA 8.0 RC and CUDA 7.5. This is in terms of registers used and probably more ‘under the covers’. In my case if I compile my kernel for either 5.2 or 6.1 using CUDA 8.0 RC 32 registers are used, while if I compile the same kernel using CUDA 7.5 for 5.2 then 40 registers are used.
The bug I am seeing appears only using CUDA 8, and not at all using CUDA 7.5 for the exact same code using the exact same inputs.
This has to do with the properties of warps and the threads in a warp.
I cannot post the actual code because it is for a client but I think I can describe the issue;
There are a number of warps (lets say 8, so 256 threads per block) launched and warp has its own shared memory buffer which only that warp accesses.
At first each thread in a warp loads one 8 byte value from global memory into shared then each thread in that warp enters a loop where they read all values from that shared array and use those values to write out to global memory via atomics. There are no possibilities for warp divergence before this loop and there are no cases where 32 threads are not participating in the updates.
The problem is that using CUDA 8.0 RC running on a Pascal Titan X the answers are wrong by a small amount and that error amount varies each run.
With CUDA 7.5 I ran the code over 200 hundred times on two different machines with Maxwell GPUs and CUDA 7.5. On those machines there were no issues ever and the answer were always correct.
If I force the first thread in the warp (lane 0) to do all the shared memory updates and compile using CUDA 8.0 RC then this issue goes away, but the code is slower and more registers are used for compilation.
It does not matter if I qualify the shared memory as ‘volatile’ in this case.
I know the warp synchronous programming is discouraged, but for this kernel it is necessary.
I even used the following inline PTX after each thread in a warp does the update to shared which did not make a difference;
__device__ __forceinline__ void __bar_sync(const unsigned int id, const unsigned int threads){
asm volatile("bar.sync %0, %1;" : : "r"(id), "r"(threads));
}
Each thread in a warp of 32 updates a float2 value in __shared__ memory
....
<b>__bar_sync(local_warp_id,32);</b>
....Each thread then enters a loop where they all read from those __shared__ values(read only)
where the local_warp_id is the thread number divided by 32 (so in this case 0-7). As I understand this bar.sync should force the warp to wait until all threads have arrived at the barrier. Please correct me if I am wrong in this assumption.
I could post the SASS if that would help, but first want to make sure that;
- I did not make an incorrect assumptions about warp behavior
- If I did make a mistake then why with CUDA 7.5 does the implementation seem always return the correct answers?