Unfortunately, it’s not open-source but I figured it would be nice to show some plots.
The NVIDIA ModernGPU (merge) and CUB (radix) implementations are state of the art but HotSort dominates below ~2-4m 64-bit keys.
I’ve put it off for years, but at some point I’ll replace the higher level merging kernels with something better and the performance curve should move higher and flatten out when sorting 10’s of millions of keys.
It’s funny how just a few years ago there was serious debate over whether GPUs could efficiently beat CPU sorting algorithms. :)
Not really that great of a paper, but if I am interpreting their charts correctly your HotSort is far superior to the listed implementations.
I have always found that thrust::sort(), which I believe is a radix sort(for primitives) is very fast. How does HotSort compare to thrust::sort() for a large array(2^29) of 32 bit floats?
pretty simple radix sort implementation [1] sorts 32-bit keys at 90 MKeys/sec [2] on the single core of i7-4770. With a few more optimizations it should be 110 MKeys/sec. So, the 18-core monster should process ~1 GKeys/sec
not sure what exactly you mean. radix sort may be preferred to other sorts in cases where it’s faster. 32-64 bit key is large enough for many usecases and additional 32-64 bit data can carry index of corresponding array element, so sorting will produce a sort of DB table index.
my personal usecases are Shindler Transform and LZ search. In both cases, we have 3-8 byte key, and 2-4 byte data. For LZ search those data bytes is just index in the dictionary, so sorting allows us to cluster together dictionary elements with the same first few bytes.
By “use case” I meant actual production code that includes this sort of radix sort (using 32-bit or 64-bit keys). I have 25 years of software optimization experience, and have worked with customers from diverse industries, yet have never come across any software that needed this. So I am curious where this would show up. Business analytics?
I will have to look up the Shindler Transform as I have never come across the term.
[Later:] I guess that should be “Schindler Transform”. Used in compression software from what I can tell from a quick internet search.
The use case @BulatZ describes above sounds similar to how I use HotSort.
Packing interesting attributes into a 64-bit key (or larger) and sorting them into contiguous buckets/bins/spans for consumption by downstream kernels is a very performant and clean idiom.
One observation I’ve made is that if you can somehow control your sorting stage so that it remains within its performance “sweet spot” then the cost of sorting becomes negligible when compared to other kernel load balancing schemes.
For example, sorting 16K 64-bit keys can be completed on a single Maxwell multiprocessor in 83 usecs.
Only one gmem load and one gmem store per key. It’s nearly free. :)
I have never worked with databases, so that may explain my lack of exposure to fast sorting outside the classroom.
I have worked on the implementation of an OpenGL pipeline. I do not recall any sorting in that, though. Well, I guess there is implicit sorting by means of the Z-buffer, but that is specialized hardware, no software required. What kind of graphics operations benefit from fast sorting? Tile-based rendering or ray-tracing maybe?