Bank conflict hell: (could a shuffle based sort help)?

I have an array of unsigned Ints (16 bits) stored in a 3D array of size (2048, 2560, 16). For each group of 16 (minor access) I want the median. I have been doing this by sorting each chunk of 16 and taking the middle value.

Another way to say this: I want to sort 16 values, each 16 bits in size, and then grab the middle.

Rather than have a block size of 16, I group the blocks as (32x16) <<32,16>> to maximize cell occupancy.

Data is stored in shared memory, and that is leading to what I believe is a cache bank conflict.

medianGroup = 32;
medianWarp = 16;

ushort[,] window = thread.AllocateShared<ushort>("window", medianGroup, medianWarp);

The truth it is easier to take the median of 15 values, so only 15 of the 16 are used, and I only need to sort half the values to get the median. For this reason (and contractual) I am not using libraries such as thrust, cub, etc.

I also found that a simple bubble sort worked, but it is getting a nasty bank conflict when it does the swap.

The code snippet is below with comments. I also tried to do this with a warp shuffle operation to do an insertion sort. That would help, but I have a bug in it that escapes me.

The code is in C#, since it is more succint that way, but I can provide the CUDA-C also

bubble sort:

//frames is the raw 3d array, window is the shared memory array that I will operate over

ushort[,] window = thread.AllocateShared<ushort>("window", medianGroup, medianWarp);
window[group, d] = frames[y, x, d];  // d = thread.x, group = thread.y (x & y are the block.x, and block.y)



if (d == 0) // all threads in warp load L1 cache, but only one needs to the sort
{
    for (int j = 0; j < medianWarp / 2; j++) // we only need to sort 1/2 of the array to get median
    {
        int min = j;
        for (int l = j + 1; l < frameStackSize; l++) // frameStackSize is 15, 
        {
            if (window[group, l] < window[group, min])
                min = l;
        }
        ushort temp = window[group, j ];
        window[group, j ] = window[group, min ];
        window[group, min] = temp;
    }
}

OK, I know you are groaning about the bubble sort, so was I. But it was quick. So I tried to do the same thing with an insertion sort using CUDA shuffleDown, but my code is broken. This is a better solution since it can use all threads in the warp, not just lane #0. (with the other lanes just loading L1).

// me is short int. d is the lane id in the warp (thread.x)
// the trick is to pad the warp with big values (65535) at the end after the 15th value, so that there is enough to shuffle.

for (int j = 0; j < medianWarp / 2; j++)
{
    me = ((d + j) < medianWarp) ? window[group, d + j] : (ushort)65535;
    //thread.SyncThreads();
    for (uint i = 1; i < medianWarp; i *= 2)
    {
        n = (ushort)thread.ShuffleDown(me, i, medianWarp);
        me = (n < me) ? n : me;
    }
    if (d == j)
    {
        window[group, j] = me; // this is also causing a bank conflict. But I could use a different array to avoid that
    }
}

Personally I would use halfwarp (16 lane) bitonic sorts using shfl_xor then you don’t need any shared memory and each warp does two groups. Then you can use shfl to grab the two middle values and mean those together (or mean then grab). This does assumes you have a CM3.0+ hardware which is very likely.

How about doing each group of 16 within a single thread?
16 registers aren’t really demanding, you have enough groups of 16 to load the entire GPU, can select the most efficient algorithm available without having to worry about parallelization, can load the data efficiently using ushort4, and might even be able to squeeze a bit more out of it loading via uint4 instead and throwing in some SIMD instructions if your GPU supports them natively.

mwilkinson:
I have seen the paper on Kepler shuffle tips, but I can’t decipher how the xor pairs are being done. I have read it that article many times. Furthermore, I can’t use mean. I really do need the median, and that is for a set of 15, and my skills in creating a bitonic sort for 15 is not that good. Tesla K40 and Maxwell GTX980 being used)

tera:

Now that does sound like a clean approach. But can be a bit more specific on the loading. I.e. take the simple case. If I have each thread loop when starting, loading a register array of size 16 with the 16 values (consecutive in global memory) will I get the same memory coalesced reads? the Tesla K40 has a 384 byte DDR cache line. I want to use that to the MAX!!! Sorry if I am a bit dense, I am not yet an expert at CUDA. Or should I use a L1 Shared memory array? (since all sorts are independent, I don’t see any intra-block communication needed). Let’s say block size of 512 (which would work out at 8K registers) and the K40 is supposed to have 65536 per block. Sounds good to me. Am I missing anything?

Oh yes, and how would adjacent warp threads work if I am looping the loading into registers? Of course I want the warp threads reading adjacent memory (x-axis is the minor axis). But it they are even doing uint4 (two separate reads for 16 shorts) they would then coalesce?

You will need to rely on the cache for full memory access efficiency.
As you said, with uint4 you only need two reads, so if they are back-to-back the cache pressure should be minimal.
However you will need additional instructions to unpack the values, so it might be faster to use 4 ushort4 reads and accept the increased cache pressure. If you want max performance I’d just benchmark both versions.

If adjacent threads read adjacent groups of 16, the first uint4 read would only have 50% bandwidth efficiency, but the second read would be satisfied entirely from the cache, so 100% efficiency overall.

Newbie question: Are uint4 supported as a native 128-bit reads, or is it just doing a single cache load with separate 32 bit reads? are uint8s supported?

  1. can uint8 be done, is it worth it?: http://stackoverflow.com/questions/4189243/whats-a-good-alternative-to-uint8-t-when-it-is-not-provided-by-the-compiler
  2. is 128 load supported?: http://stackoverflow.com/questions/12340384/are-there-advantages-to-using-the-cuda-vector-types
  3. how do I cast the uint4 to ushort array ? http://stackoverflow.com/questions/13102205/how-to-properly-cast-a-global-memory-array-using-the-uint4-vector-in-cuda-to-inc

Also, I don’t understand you statement about adjacent warp threads. I can see that since the stride between threads is 32 bytes, I should be able to get 12 threads to do a single coalesced read, and the next read would be the same. So I see a total memory bandwidth of 50% but at a constant rate. Did I make a mistake?

I was talking about median, the median of an even amount of values is the mean of the middle two. If you only have 15 values you can just put a dummy maxint for the 16th value then grab the 8th value for the median.

The 16 width should like.

int swap( int x, int mask, int dir)
{
    int y = __shfl_xor(x, mask, 16);
    return x < y == dir ?  y : x;
}
x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); //  2
x = swap(x, 0x02, bfe(laneid, 2) ^ bfe(laneid, 1)); //  4
x = swap(x, 0x01, bfe(laneid, 2) ^ bfe(laneid, 0));
x = swap(x, 0x04, bfe(laneid, 3) ^ bfe(laneid, 2)); //  8
x = swap(x, 0x02, bfe(laneid, 3) ^ bfe(laneid, 1));
x = swap(x, 0x01, bfe(laneid, 3) ^ bfe(laneid, 0));
x = swap(x, 0x08,                  bfe(laneid, 3)); // 16
x = swap(x, 0x04,                  bfe(laneid, 2));
x = swap(x, 0x02,                  bfe(laneid, 1));
x = swap(x, 0x01,                  bfe(laneid, 0));

If you want to optimize memory bandwidth have each thread process 2 groups. Read in a ushort2 shufl the two halfs around so that that values 0 and 16 are in thread 0, 1 and 17 in thread 1 … 15 and 31 in thread 15. Then the other half of the warp is 32 and 48 for thread 16, 33 and 49 for thread 17 … 47 and 63 for thread 31. Why that sounds complicated it’ll be only a handful of instructions. With this you have maximum memory bandwidth with 128 byte reads (assuming the input array is properly aligned), use no shared memory and a relatively small number of registers. The cache isn’t relevant for this since you never touch the same cache line twice.

mwilkinson:

That is cute with the full warp ushort2. I like it. But excuse the lame question. I am not finding the bfe() api description. I know it stands for bit field extract, but I cannot find in the nvidia docs or elsewhere.

https://github.com/NVlabs/moderngpu/blob/master/include/device/intrinsics.cuh

The link from Vectorizer contains a bfe implementation. I use

__device__ __forceinline__ unsigned int
bfe(unsigned int x, unsigned int bit, unsigned int num_bits=1)
{
    unsigned int ret;
    asm("bfe.u32 %0, %1, %2, %3;" :
        "=r" (ret) : "r" (x), "r" (bit), "r" (num_bits));
    return ret;
}

I’ll chime in too. :)

@DoctorG, if you really want to find the median then there are “median networks”.

A 15 element median network is 47 compare-exchanges vs. 56 for a full sort. A median network leaves the median value at the middle register/wire/index.

Since you have so much data to work with, @tera’s suggested approach would let you run each thread in isolation. @mwilkinson’s approach is good too but you have so much data [(2048, 2560, 16)] and you’re dealing with such a small network that your problem is practically begging for an in-thread approach. :)

As far as coalesced loading, you could start out with a non-coalesced 16 byte per thread load (uint4) and unpack as needed. After that you might consider having two lanes cooperate and bounce the “missing” uint4 through shared.

The takeaway is that whether you use a sorting or median network it’s going to be crazy fast since you’re only dealing with a tiny number of elements and < 60 compare-exchanges.

Your optimization goal should be to become I/O bound.


Edit: @tera hints at it, but if the PTX vmin2/vmax2 ops are fast on your target GPU you could pull 4 uint4’s into each thread and run 2 median networks at once per thread. This would probably only be a win if you could arrange to have your 3D array layout already in a staggered format that would allow the kernel to simply load and go. Otherwise, the cost of rearranging and packing the data could be high.

@allanmac, could you please provide link(s) for median networks?
When I google it, all I see are median-joining networks.

@Vectorizer, median networks are really just “selection” networks – e.g. “give me the 8th out of 15 elements” or “give me the max out of N elements”. The latter is just a “bubble network” and can be done in one pass so you can see selection networks are going to be less effort than a full sort.

You could also imagine building a network that gave you the min, max and median… Pretty cool, eh?

Another (possibly sloppy) way of thinking about it is that you build a sorting network and then skip any compare-exchange operation that can no longer influence the selected position(s) of interest.

A quick Google doesn’t provide many references. Try “sorting and selection networks”.

Knuth, of course, mentions it in a paragraph or two but each sentence you parse grants you an additional PhD. :)

@allanmac I agree, this is purely data bound. I did implement the @tera approach, and I am getting 4ms on my GTX980 (maxwell) which is my desktop dev system (as opposed to the Telsa K80 which is the target box).

I am only doing the reads as 4 consecutive 64 bit reads (unsigned long long ), But I have the option to organize the array. So it is now structured as [4, 2560*2048] of unsigned long long. This way the warp adjacent threads are reading adjacent memory in each cycle, and of the FOUR loads does stride by 10MB, but I am thinking that is ok given a block size of 256. There are lots of blocks in the grid (20K)

The odd thing is that I this is no better than my old solution which also was 4ms.

Brief background. We are getting 10MB frames of video as 5M unsigned shorts. Each frame is then stuffed into the stack of 16, and I need the median pixel for a sequenece of 15 frames (frames are pushed on the stack, and old frames fall off). I can build the stack how I want. I used to build it as so that each frame was 10MB apart and do a single thread per pixel.

Ah… interesting!

Some thoughts:

  • A GTX 980 should be able to load ~900 MB in 4 ms (4/1000*224GB/sec).

    A napkin calculation says that loading 15 and storing 1 10 MB frame occupies (160MB / 224GB) of the GTX 980’s bandwidth which works out to < 1 ms.

    So, assuming the math is right, you’re already within a factor of ~5. That’s a good place to start.

  • A GTX 980 can have up to 1024 32-register resident warps (16 SMM * 64 warps) and a max of 512 resident blocks.

    Blocks that can’t fit have to wait.

    In your kernel, if each thread only puts 32 bytes of loads in flight (1024 bytes/warp) then the resident warps have to wait hundreds of cycles for 1MB to arrive before executing less than 200 instructions and writing out 64KB.

    If you have 1024 resident warps and want to load 224 GB/sec over 1 second, then each warp would need to issue ~1.835m 128 byte memory transactions. If 1 gigarequests/sec. can be issued then each warp would need to do so every 585 clocks.

    The implication is that if you don’t get enough requests “in flight” then you’ll never reach the device’s peak memory bandwidth.

    Perhaps you should try having each warp load‣median‣store 2/4/6/8 pixels at a time — whatever will maximize your 980 and K80 throughput.

  • if you're not already, you should try using const+__restrict__ or explicit __ldg() loads.
  • nvprof --query-events --query-metrics --devices lists all the counters and stats you can capture.

If the app is processing video, isn’t it going to be limited by the speed at which it can receive data across PCIe, at most about 12 GB/sec?

Definitely use the profiler to zero in on bottlenecks, it is much easier than to try to pinpoint this based on source code alone.

Since the hardware queues that track outstanding loads have finite depth, you would want each load to be as wide as possible to maximize the total amount of in-flight data tracked. The widest loads provided by the hardware are 16-byte loads, e.g. uint4.

As allanmac and njuffa point out, key to maximising memory throughput is to keep enough requests in flight.
I haven’t checked with recent compilers, but historically the compiler had some weaknesses with the “multiple loads, then unpack data” sequence (it is nontrivial to automatically optimize these as the hardware has undocumented requirements around which registers > 32 bit accesses can address, even though the optimal patterns in highly regular cases like yours might be easy to spot by a human).
So you might want to start up the SASS disassembler and check that the data loading instructions have been scheduled back-to back, rather than being interleaved with data unpacking instructions which would force limit the requests in flight to one per warp.

The compiler has applied “load batching” for many years. However, this technique often drives up register pressure significantly, which in turn negatively impacts performance through lower occupancy. There is no perfect heuristic for this, but most complaints I have seen in the past were about too much batching, and thus too much register pressure. It wouldn’t hurt to check the disassembled code, of course.