Efficient use of Array of Structures

I have an incoming stream of structs/objects that need to be batched processed on the GPU where each kernel uses one of those structs to create a different struct as output. The simplistic solution would be something like this:

__global__ void kernel( const InputType* dev_inputs, OutputType* dev_outputs, int len )
{
    int id = thredIdx.x + blockDim.x * blockIdx.x;
    if ( id > len )
        return;

    InputType In = dev_inputs[id];
    OutputType Out;

    // using member variables from In to fill the member variables of Out here

    dev_outputs[id] = Out;
}

While in the calling code, the input array has been loaded, batched, and that portion of it cudaMemcpy’ed to the dev_inputs. Also, the dev_outputs array was cudaMalloc’ed but left empty. Both the InputType and the OutputType have copy constructors that copy each member variable in order.

It’s my understanding that this will work, however the accesses to the member variables in the copy constructor cannot be coalesced. This leads to a performance hit as each thread in a warp is separately loaded with each member variable. The same thing would happen in the kernel if, instead of a copy of the whole struct, the individual member variables were accessed and stored locally:

auto first_var = dev_inputs[id].first_var;
auto second_var = dev_inputs[id].second_var;
// etc...

Basically, since all the threads try to access “first_var” in a different struct, and those variables are not adjacent in memory, they cannot be coalesced into a single load and all threads are paused repeatedly as a result.

So… What to do about this?

The two solutions I can find are to: 1) split the data into separate arrays for each member variable. This way all the “first_vars” are adjacent. However, as my particular case is in a streaming environment, where transposing the massive amounts of data as it flows through (and transposing it back as it leaves) would be a major performance hit.

And 2) use the __shfl command to allow the threads to access adjacent memory addresses, even if those addresses do not hold the correct data, then “shuffle” the data between threads to get it in the right locations. At least, that’s my waving-hands-at-it-vaguely interpretation—it’s got me a little lost as to how it is actually applied in my situation.

Thankfully, there is a library, Trove, that supposedly does the shuffling for you. However, the documentation for it is paltry, the sample code is overly simplistic, and I can’t find examples of anyone using it for actual projects. In short, it’s no more usable for me than the __shfl command itself. (Which I did try and work out how to use on structs.)

Maybe there’s a third option out there I haven’t found. Maybe someone has used Trove and can explain its application to a problem such as mine. Maybe I’ll have to eat the performance hit from non-coalesced accesses or transposed inputs and outputs.

Any insight into all this would be appreciated. :)

what is sizeof(InputType) and sizeof(OutputType) ?

For my particular case, the input struct is 80 bytes (9 doubles and two ints) and the output is 24 bytes (three doubles).

I’m guessing the question is because the answer to whether or not it’s faster to use shfl over non-coalesced access is dependent on the size of the data? I saw the graphs indicating that shfl becomes more efficient for larger structures, and at 80 bytes, the InputType here is off the chart, indicating shfl is much faster.