I am devising a kernel that requires vector rotation. More specifically, I am looking for an efficient way to do the following (simplified senario):
Each thread has the same copy of a vector:
v0, v1, v2, …, v63
A rotation rotates these registers within the thread:
v63, v0, v1, …, v62
Let’s say variables a0 to a63 hold v0 to v63. Then
a0 = v0, a1 = v1, …, a63 = v63
becomes
a0 = v63, a1 = v0, …, a63 = v62.
So the algorithm works as follows:
(Each thread has the same vector but different data)
- Read in v0 to v63 as a0 to a63.
- Do some calculation with existing data.
- Rotate a, then return to step 2.
There will be 63 rotations, because the 64th rotation is just the original vector. I don’t think unrolling the loop for 64 times is a good choice. The main calculation is non-trivial and the resulting code would be too long.
This pseudocode shows how I currently get this done, through shared memory:
// suppose blockDim = 64
__shared__ float v[64];
// coalesced read from global memory
v[tid] = vec[tid];
__syncthreads();
float a[64];
for rot = 0 to 63:
// broadcast from shared memory
for i = 0 to 63:
a[i] = v[i];
// do some really complicated operations with data and a
// here += is only for demostration
for i = 0 to 63:
data[i] += a[i];
__syncthreads();
// each block reads in v with an offset, no bank conflict
int index = (tid + 63) % 64;
float x = v[index];
__syncthreads();
// then writes back
v[tid] = x;
__syncthreads();
Question: Is it possible to rotate more efficiently?
–
Actually, to reduce register pressure, a vector can even be split to 2 or even more threads.
For example, two neighboring threads have the same copy of a vector (v0…v63):
thread 0: v0 v1 v2 … v31
thread 1: v32 v33 … v63
thread 2: v0 v1 v2 … v31
thread 3: v32 v33 … v63
Rotate registers among two threads:
thread 0: v63 v0 … v30
thread 1: v31 v32 … v62
thread 2: v63 v0 … v30
thread 3: v31 v32 … v62
And in my application, the vector dimension is around 128, and the rotation is a varied one. I use Maxwell cards.