Cyclically rotate among registers within threads

I am devising a kernel that requires vector rotation. More specifically, I am looking for an efficient way to do the following (simplified senario):

Each thread has the same copy of a vector:
v0, v1, v2, …, v63

A rotation rotates these registers within the thread:
v63, v0, v1, …, v62

Let’s say variables a0 to a63 hold v0 to v63. Then
a0 = v0, a1 = v1, …, a63 = v63
becomes
a0 = v63, a1 = v0, …, a63 = v62.

So the algorithm works as follows:
(Each thread has the same vector but different data)

  1. Read in v0 to v63 as a0 to a63.
  2. Do some calculation with existing data.
  3. Rotate a, then return to step 2.

There will be 63 rotations, because the 64th rotation is just the original vector. I don’t think unrolling the loop for 64 times is a good choice. The main calculation is non-trivial and the resulting code would be too long.

This pseudocode shows how I currently get this done, through shared memory:

// suppose blockDim = 64
__shared__ float v[64];

// coalesced read from global memory
v[tid] = vec[tid];
__syncthreads();

float a[64];
for rot = 0 to 63:
    // broadcast from shared memory
    for i = 0 to 63:
        a[i] = v[i];

    // do some really complicated operations with data and a
    // here += is only for demostration
    for i = 0 to 63:
        data[i] += a[i];

    __syncthreads();

    // each block reads in v with an offset, no bank conflict
    int index = (tid + 63) % 64;
    float x = v[index];

    __syncthreads();

    // then writes back
    v[tid] = x;

    __syncthreads();

Question: Is it possible to rotate more efficiently?

Actually, to reduce register pressure, a vector can even be split to 2 or even more threads.

For example, two neighboring threads have the same copy of a vector (v0…v63):

thread 0: v0 v1 v2 … v31
thread 1: v32 v33 … v63
thread 2: v0 v1 v2 … v31
thread 3: v32 v33 … v63

Rotate registers among two threads:

thread 0: v63 v0 … v30
thread 1: v31 v32 … v62
thread 2: v63 v0 … v30
thread 3: v31 v32 … v62

And in my application, the vector dimension is around 128, and the rotation is a varied one. I use Maxwell cards.

“Each thread has the same copy of a vector:”

why?

“(Each thread has the same vector but different data)”

why?

To be more clear:

Each thread reads in different data but the same a.

data[i] += a[i];

Here data is the data, which is different for each thread.
a is the vector, which is broadcast across the block (or the grid), and then rotated.

Or let’s say data is not an array, then we might have something like, for each thread

data += a[0] * a[1] - a[2] * a[3] / a[4] … a[63];

then we rotate a, then do the same calculation on data.

This is the algorithmic pattern used for solving some NP-hard mathematical problem.

i am not questioning “the algorithmic pattern used for solving some NP-hard mathematical problem”
i am however questioning the computational redundancy

if all all threads use the same input a, but with different data, then why broadcast it to all threads?
why not simply keep a in shared memory, and have each thread access it (what it needs) directly from shared memory?

data[i] += a[i];

here, i would argue that a should be in shared memory, and that there is no need for each thread to keep a copy

also, if a rotated is identical for all threads, why have each thread rotate a?
again, i would rotate a in shared memory, and have each thread access what it needs from shared memory

and with

data += a[0] * a[1] - a[2] * a[3] / a[4] … a[63];

if only data differs per thread, with a[0] * a[1] - a[2] * a[3] / a[4] … a[63] being the same for all threads, i would first solve a[0] * a[1] - a[2] * a[3] / a[4] … a[63], such that

data += a_0to63_solved;

more specifically, i would attack the patterns in a[0] * a[1] - a[2] * a[3] / a[4] … a[63], such that multiple threads can work in on it, to solve it

Sorry for my inability to express clearly…

Actually data and a interact closely. The operation I need is like:

q = data[0]*a[0] + … + data[63]*a[63]; // dot product of data and a
q = f(q); // some function that post-processes q
data[0] += q * a[0], … data[63] += q * a[63]; // vector scalar multiplication and addition
rotate a, and back to first line

and it seems a bad idea to read a twice (1st and 3rd line),
but keeping a in registers also causes severe spilling (at least I need to keep 64 data and 64 a’s in a thread)

number_of_elements = e = 63;

__shared__ float a[64];
__shared__ float b[64];
__shared__ float data[64];

// at this point, data[] and a[] are taken to be already read (from global memory)

if (threadIdx.x < e)
{
b[threadIdx.x] = data[threadIdx.x] * a[threadIdx.x];
}

__syncthreads();

// now, multiple threads participate in a sum reduction/ scan of b, to obtain the dot product
// i am not going to insert the sum reduction/ scan code here

// calculate f_q = f(q) now

if (threadIdx.x < e)
{
data[threadIdx.x] += f_q * a[threadIdx.x];
}

__syncthreads();

// lets rotate a[]

float a_;

if (threadIdx.x == 0)
{
a_ = a[63];
}

if ((threadIdx.x < e) && (threadIdx.x > 0))
{
a_ = a[threadIdx.x - 1];
}

__syncthreads();

if (threadIdx.x < e)
{
a[threadIdx.x] = a_;
}

Thanks. This method spreads data across threads, and is what I first came up with. My concern is that the block reduction/scan would be too expensive. There are lots of redundant calculations during parallel reduction. This is why I layout data like that. Or maybe this is still a win?

I am also thinking about how to deal with my target dimension 130. If the block dimension is 128, I can’t read it just by one instruction. If the block dimension is 256, then we waste bandwidth.

Each thread holding 130 elements (my layout) and 130 threads each holding 1 element (your layout) are two endpoints. What about in-between?

Therefore, I am assessing the viability of 4 threads cooperatively holding 130 elements (33 elements each), or 8 threads (17 each).

“My concern is that the block reduction/scan would be too expensive.”

i do not see how it would be more expensive than individual threads calculating the sums on their own

“There are lots of redundant calculations during parallel reduction”

may i respectfully disagree; and may you kindly explain this to me

“I am also thinking about how to deal with my target dimension 130”

use 5 warps, or then 32 * 5 threads

“If the block dimension is 256, then we waste bandwidth”

how?

“Each thread holding 130 elements…”

when threads hold the same data, and the data changes, all threads must either discard the data and re-read it, or individually recalculate the data
if the data is identical across the threads, i perceive this to be (grossly) redundant
i would also perceive threads holding the same data as (grossly) unnecessary register pressure