I have found a situation where a __threadfence_block() call is required, but only on some cards. More confusingly, it is my understanding that the __threadfence_block() call should not be required on any card.
The code listed below works has been tested and always works properly on both a 780M and 1080 (without call to __threadfence_block()). However, it never works correctly without that call on a 970M (tested with CUDA 7.5 on Ubuntu 15.10).
The kernel performs a modified 2-Nearest Neighbors query (under the Hamming norm) for 2 sets of bitvectors of length 512 (this is useful in computer vision). The error occurs at the top of a partially unrolled loop. In the loop, a const uint64_t value is read from global memory, XOR’d against several other const register uint64_t, and each result passed to __popcll(). The results are then packed and reduced with __shfl_xor().
On the 780M (tested on Windows) and 1080 (tested on Linux) this kernel always finds all matches correctly, and never finds any matches on my 970M… unless a __threadfence_block() is added.
This __threadfence_block() may be placed anywhere before the final if statement, even before the read itself. This seems very strange to me. Adding calls to __threadfence_block() before and after the loop does not change this behavior.
What is stranger still is that I added a call to CUDA’s printf() in the kernel, wrapped by an if statement that selected only thread 0 in warp 0 in block 0. This also fixed the problem, for every block, without a __threadfence_block().
The __threadfence_block() call only lowers performance by ~8%, but this is performance critical code in an O(n^2) task meant to be released as a library.
Why is this happening? Is this a bug?
Full project (just call “make run”): https://drive.google.com/file/d/0B40zIKz22S79M1ljd21FZ2Y0eDA/view?usp=sharing (if you see a very small numbers of matches, lower the threshold defined at the start of main.cpp then “make clean; make run”)
__global__ void CUDAMATCH_kernel(const uint64_t* const __restrict__ g_query, const int num_q,
const uint64_t* const __restrict__ g_training, const int num_t, int* const __restrict__ g_match, const int threshold) {
register uint32_t offset = ((threadIdx.x & 24) << 3) + (threadIdx.x & 7) + (blockIdx.x << 11) + (threadIdx.y << 8);
const register uint64_t qa = g_query[offset ];
const register uint64_t qb = g_query[offset + 8];
const register uint64_t qc = g_query[offset + 16];
const register uint64_t qd = g_query[offset + 24];
const register uint64_t qe = g_query[offset + 32];
const register uint64_t qf = g_query[offset + 40];
const register uint64_t qg = g_query[offset + 48];
const register uint64_t qh = g_query[offset + 56];
register int best_i = -1;
register int best_v = 100000;
register int second_v = 200000;
#pragma unroll 7
for (int t = 0; t < num_t; ++t) {
// __threadfence_block(); // Adding this call before the read fixes things.
const register uint64_t train = g_training[(t << 3) + (threadIdx.x & 7)];
register int dist0 = __popcll(qa ^ train) | ((__popcll(qe ^ train)) << 16);
register int dist1 = __popcll(qb ^ train) | ((__popcll(qf ^ train)) << 16);
register int dist2 = __popcll(qc ^ train) | ((__popcll(qg ^ train)) << 16);
register int dist3 = __popcll(qd ^ train) | ((__popcll(qh ^ train)) << 16);
dist0 += __shfl_xor(dist0, 1);
dist1 += __shfl_xor(dist1, 1);
if (threadIdx.x & 1) dist0 = dist1;
dist2 += __shfl_xor(dist2, 1);
dist3 += __shfl_xor(dist3, 1);
if (threadIdx.x & 1) dist2 = dist3;
dist0 += __shfl_xor(dist0, 2);
dist2 += __shfl_xor(dist2, 2);
if (threadIdx.x & 2) dist0 = dist2;
dist0 = ((dist0 + __shfl_xor(dist0, 4)) >> ((threadIdx.x & 4) << 2)) & (((4 ^ (threadIdx.x & 4)) << 9) - 1);
if (dist0 < second_v) second_v = dist0;
// __threadfence_block(); // Can also add it here, or anywhere in between.
if (dist0 < best_v) {
second_v = best_v;
best_v = dist0;
best_i = t;
}
}
if (second_v - best_v <= threshold) best_i = -1;
const register int idx = (blockIdx.x << 8) + (threadIdx.y << 5) + threadIdx.x;
if (idx < num_q) g_match[idx] = best_i;
}