I have an array of unsigned Ints (16 bits) stored in a 3D array of size (2048, 2560, 16). For each group of 16 (minor access) I want the median. I have been doing this by sorting each chunk of 16 and taking the middle value.
Another way to say this: I want to sort 16 values, each 16 bits in size, and then grab the middle.
Rather than have a block size of 16, I group the blocks as (32x16) <<32,16>> to maximize cell occupancy.
Data is stored in shared memory, and that is leading to what I believe is a cache bank conflict.
medianGroup = 32;
medianWarp = 16;
ushort[,] window = thread.AllocateShared<ushort>("window", medianGroup, medianWarp);
The truth it is easier to take the median of 15 values, so only 15 of the 16 are used, and I only need to sort half the values to get the median. For this reason (and contractual) I am not using libraries such as thrust, cub, etc.
I also found that a simple bubble sort worked, but it is getting a nasty bank conflict when it does the swap.
The code snippet is below with comments. I also tried to do this with a warp shuffle operation to do an insertion sort. That would help, but I have a bug in it that escapes me.
The code is in C#, since it is more succint that way, but I can provide the CUDA-C also
bubble sort:
//frames is the raw 3d array, window is the shared memory array that I will operate over
ushort[,] window = thread.AllocateShared<ushort>("window", medianGroup, medianWarp);
window[group, d] = frames[y, x, d]; // d = thread.x, group = thread.y (x & y are the block.x, and block.y)
if (d == 0) // all threads in warp load L1 cache, but only one needs to the sort
{
for (int j = 0; j < medianWarp / 2; j++) // we only need to sort 1/2 of the array to get median
{
int min = j;
for (int l = j + 1; l < frameStackSize; l++) // frameStackSize is 15,
{
if (window[group, l] < window[group, min])
min = l;
}
ushort temp = window[group, j ];
window[group, j ] = window[group, min ];
window[group, min] = temp;
}
}
OK, I know you are groaning about the bubble sort, so was I. But it was quick. So I tried to do the same thing with an insertion sort using CUDA shuffleDown, but my code is broken. This is a better solution since it can use all threads in the warp, not just lane #0. (with the other lanes just loading L1).
// me is short int. d is the lane id in the warp (thread.x)
// the trick is to pad the warp with big values (65535) at the end after the 15th value, so that there is enough to shuffle.
for (int j = 0; j < medianWarp / 2; j++)
{
me = ((d + j) < medianWarp) ? window[group, d + j] : (ushort)65535;
//thread.SyncThreads();
for (uint i = 1; i < medianWarp; i *= 2)
{
n = (ushort)thread.ShuffleDown(me, i, medianWarp);
me = (n < me) ? n : me;
}
if (d == j)
{
window[group, j] = me; // this is also causing a bank conflict. But I could use a different array to avoid that
}
}