Here is a problem I have come across, and while I have a decent solution which works, I was wondering if there is a better way.
Imagine a large grid, for example 1200x2000(stored in 1D contiguous memory), and I essentially need to perform an atomic update(an atomicAdd() for example) on an adjacent 2x2 region(x,y), (x+1,y), (x,y+1), (x+1,y+1). The values added to each location in that 2x2 area will all be different based on some other input values loaded into shared memory.
From loaded input values loaded from an input table I will get the base output (x,y) locations for a given thread, and then I need to update that region(or the portions of that region which are in bounds) with a corresponding set of different values.
Unfortunately it is not feasible to do this in reverse, where I could examine a pixel location in the output buffer and then derive “what values go here?”.
The input data set can vary quite a bit, and that input data per thread block will map out to writes in a rectangular region from size (0,0) to (336,32) for example. Because the upper bound of that write region area can be large, it makes using shared memory as a temporary scratch pad complicated and not faster than my current implementation.
Another problem is that the number of memory updates per thread block varies by a large amount, again depending on somewhat random real-time changing input data.
I have engineered this enough to get all the writes performed by a thread block to be in the same ‘general’ rectangular region, but those writes are not enough coalesced because the strides per region(between consecutive base regions) can be larger than 2 elements in either direction(x and y).
The current implementation is getting ok memory bandwidth, but there is plenty of room for improvement.
I cant imagine that I am the first to come across such a problem, so maybe someone can suggest a better possible approach?