Non coalesced read/write in global vs shared

I have to implement an algorithm which shifts a 2D patch of complex data in an arbitrary fashion.

I could just read the data from global memory in said (computed) arbitrary order and write it back coalesced.

Or read it coalesced and write it back in arbitrary order.

Or I could read/write it coalesced into shared memory, read it un-coalesced from shared, and write it back coalesced to global memory.

Any thoughts/experience on which would be faster?

Or which un-coalesced global reads or writes is faster?

Is buffering to shared memory faster than read/write un-coalesced global?

Both cublas and my gemm/convolution kernels shuffle things around in shared memory prior to writing out to global. If you can do it in a way such that the shuffle addresses all stay within the same region for each warp, you don’t need synchthreads. The shared memory access is very fast compared to global and the latencies can often be hidden by TLP, so it’s most likely going to be faster for you, particularly if you can avoid the synchthreads.

Thanks for your response, Scott.

If by “same region” you mean never crossing warp boundaries, since my patch size is 19x21 and I believe a warp is 32 threads, it seems if a single thread processes 4x4 (or 8x2) data points, all of the shared memory will be processed by one warp.

I’m not sure if there is any other way to do what you referenced?

I meant that you don’t have one warp writing to a piece of memory that another warp is reading from. Within a warp you can shuffle things however you like. Another thing to look at is the warp shuffle instruction. That can be quicker than using shared memory but isn’t as flexible.

Yeah, that’s what I meant by (a warp) crossing the boundaries (for the data it reads).

IOW, writing outside of the region it reads. I don’t know how to insure that doesn’t happen when shifting data, other than keeping all the data accesses within a single warp.

I’ve considered that (briefly) but it seems much more complicated.

I just don’t know enough about how warps are scheduled.

Examples I’ve seen don’t help much.

I illustrate the technique I use here (which I borrowed from cublas):

As for scheduling, you have to assume that at any given clock any available warp can be swapped in and start running. There is zero cost to context switch a warp.

In general a write to global memory could be faster than a read, since writes can be cached.
Non-coalesced reads/writes in shared memory have no negative effects on your performance, just watch out for bank conflicts there. Since you did not explain in what way your data is shifted: I take the example of a matrix transpose. You can divide the matrix in blocks, which are also represented by cuda blocks. Every block reads coalesced, flips the indexes in shared memory, and writes back coalesced. Fastest version you can get. Non-Coalesced reads/writes will always ruin your performance. So if you can avoid them in any way, do it.

As for shuffle being too complicated: You need to decide whether you want to write a fast program or just one that works, if it’s the latter one you could also use a cpu. Shuffle is probably the most efficient mechanism you can use at the moment.

The data is shifted either left, right, up or down, with some of the shifted edge values set to zero.

The cpu is usually going to be slower than even a less than optimal CUDA program.

IME, using shared memory is only ~20% slower than using shfl. If using shfl takes me much longer to write the code, my boss won’t be very happy.

So I take it you believe that using shared memory to maintain coalesced global reads/writes will always be faster?

I’ve begun to write it that way, but have modified my initial code to using only 1 warp (32 threads) per block instead of 1 thread per data point so I don’t have to use __syncthreads.

32 threads per block is generally not how you write fast code.

It will not always be faster, as I mentioned above it depends on how the data needs to be shifted. If you need to place your data at totally random places when you save it, then you’ll have trouble writing a kernel which can prevent global non-coalesced writes.

Do you think it would be faster to use more threads (fewer data points per thread) and use __syncthreads? Not much processing is being done; pretty much read/write.

Thanks for the pointer. Looks interesting.

I wouldn’t be too concerned with synchthreads, you’ll likely see a speed up with or without it. Though as in all things it’s best to run some tests to accurately measure the differences.

As for running with 32 threads per block, if each warp is independent from the the others, there’s not much to gain from larger block sizes, particularly on maxwell. But if you can effectively share data between warps to reduce device memory bandwidth or compute, then you should use bigger blocks.