SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting

dgm4d · July 26, 2010, 5:28am

Hi all,

We’d like to share with everyone our radix sorting implementation, SRTS Radix Sort. You can find the code, information, and performance results for a variety of GPUs and input types/sizes on our Google Code project site, Back40Computing.

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).

The radix sorting method currently seems to be the most amenable for parallelization on high-bandwith architectures, and the fastest sorting implementations for numeric keys on modern processor architectures are radix sorting designs. In the classic apples-to-oranges comparison, the state-of-the-art sorting rates are (at the moment):

[*]Intel radix sort on quad-core Core i7: 240M 32-bit keys / second (paper)

[*]Intel radix sort on 32-core Knight’s Ferry MIC (Larrabee successor): 560M 32-bit keys / sec (paper)

[*]Our SRTS radix sort on GTX480: 1,005M 32-bit keys/sec

And, unlike the former two, you can walk in off the street and buy a GTX480 and download our source code and evaluate them yourself.

In addition to performance, one of our design goals for this project is flexibility. We’ve designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types. See the project site for a (growing) compilation of performance results on a variety of devices and input types.

Although it’s slightly outdated since our refactoring to accommodate the Fermi architecture, we request that if you use/reference/benchmark this code, please cite our Technical Report:

@TechReport{ Merrill:Sorting:2010,

					author = "Duane Merrill and Andrew Grimshaw",

				 	title = "Revisiting Sorting for GPGPU Stream Architectures",

	   			  year = "2010",

				 	institution = "University of Virginia, Department of Computer Science",

	   			  address = "Charlottesville, VA, USA",

				 	number = "CS2010-03"

		  }

Cheers!

Duane

Gregory_Diamos · July 26, 2010, 9:19am

Awesome work Duane, keep it up!

I read an older version of your paper and really liked the general philosophy of identifying memory bandwidth as the bottleneck, merging stages of the algorithm to reduce memory traffic, actually measuring the theoretical and achieved memory and compute throughput, and trying to tune the algorithm one way or another to hit both of the limits at the same time. Most of the programming that I do stops when I hit one or the other. It is great that you have shown how you can take it an extra step.

For this reason, I think that this paper is a good read even for those people who care about writing high performance CUDA programming, but don’t give a damn about sorting.

I’m curious to hear what you are planning to do next.

If you have some free time and want to get people to use your code, try checking out thrust, replacing the current implementation with yours, and submitting it to nathan and jared. The thrust license seems compatible with your project’s and I would use your code tomorrow if I could just re-sync thrust and get a speedup.

Also, I saw that you have a poster at PACT this year, drop me a line at if you end up going.

allanmac · July 26, 2010, 12:46pm

Wow… congratulations!

One comment, I was about to ask why the C2050 (no ECC) is ~33% slower than the GTX 480 but after double-checking the product specs it’s clear why: C2050:448@1150MHz vs. GTX480:480@1400MHz = 77% of a 480. That’s some great scaling!

moozoo · July 28, 2010, 8:20am

Any plans for MultiGPU and optimisation to the GTX 460?

alexish · July 28, 2010, 4:47pm

Great work !

I just downloaded and compiled your code on my linux box (Ubuntu 10.04 and Cuda 3.1). In 64 bits everything run fine, but in 32 bits the program failed to launch with the error : “Cuda driver version is insufficient for Cuda runtime version” ?

dgm4d · July 28, 2010, 5:42pm

I have some ideas for how to adapt it for both (1) multi-gpu and (2) keys larger than shiftable machine words (e.g., the 10B key format for the Jim Gray sortbenchmark.org database sorting challenges). These features are not specific research objectives on my dissertation roadmap, but I think they’re interesting and have real-world impact. So… hopefully. But if anyone else wants to take this code and run with it in those directions, please, do so!

As for the GTX460, GF104 has some interesting scheduling properties that differentiate it from GF100. I won’t know if there’s a sweeter “sweet spot” that can be selected for it in terms of active worker warps or grid sizes until we actually get one in our lab. (Which will hopefully be very soon.) But I suspect that performance using the current GF100 config will be quite reasonable.

Duane

dgm4d · July 28, 2010, 5:57pm

I have this same problem on one of the 64-bit machines machines in our lab (but strangely enough, not the others), and I have not yet gotten to the bottom of it.

In fact, I get the same error on this machine when compiling the SDK examples using “make i386=1” (which is the option for compiling the SDK for 32-bit machine/device code on 64-bit machines). Compiling this way is no problem for our other Ubuntu machines. The only difference that I can discern between our “good” machines for 32-bit compilation and this “bad” one is that the “bad” one only has gcc 4.2.3 (instead of 4.2.4). IDK; I’ll will know more soon.

We’ll have to resolve this issue ourselves sometime next week, so I’ll keep you posted.

Duane

dgm4d · July 28, 2010, 6:02pm

I believe Nathan is actively working on this. I like the Thrust model of “only generate kernels that your application needs” for a variety of reasons, so hopefully they will find it to be a good match.

I am going, and would love to chat.

Duane

alexish · July 29, 2010, 11:26am

Indeed its very strange … I work with a newly installed ubuntu 10.04 with gcc 4.4.3.

I can run the SDK radix sort compiled in 64 bits on all my graphic cards (GTX 480, GTX 280 and 9500GT).

Here are the results of the benchmarks i just done for 1M elements:

GTX480 ---- SDK 0.253 10^9 elts/s ---- SRTS 0.53 10^9 elts/s

9500GT ---- SDK 0.015 10^9 elts/s ---- SRTS 0.032 10^9 elts/s

But i have some problems with the SRTS on my GTX280:

First try:

Successive tests with 1M elts:

If i reduce the number of elements to 100000 i got incorrect sorts each time:

Update:

the problem i encounter with the GTX 280 is related to the use of the gencode option. Any attempt to use this option even with only the 1.3 architecture results in the problem described before. If i don’t use gencode at all the code run on the GTX280 and sort 1M at 0.2 10^9 elts/s (respect to 0.115 10^9 elts/s for the SDK sort)

For the 32bits pb i have no clue…

dgm4d · July 29, 2010, 4:08pm

Alexish,

When you compile without the gencode specified, what SM target does ptxas report for you? In general, you’ll want to use the gencode option for 1.3 – or at least make sure from the verbose ptxas output that your kernels are generated for SM1.3. (The code is compiled very differently for SM1.0/1.1, and will be much much slower on GT200.)

As for your problem with SM1.3 specified: normally I would assume there to be a correctness issue with the sorting routines, but (1) it works on all of our GT200 architectures, e.g.:

: ~/B40CSrtsRadixSort>; ./bin/test_radix_sort_simple_3.0 --n=1048576 --i=100

Using device 0: GeForce GTX 280

key-value, 100 iterations, 1048576 elements, 3.754069 GPU ms, 0.279317 x10^9 elts/sec

Correct

and (2), I’ve actually experienced correctness issues with GT200 architectures (GTX280, 285) in terms of the hardware not performing st operations correctly. See http://forums.nvidia.com/index.php?showtopic=107786&hl=. The hardware can get “into a bad state” that seems to only be fixable by powering down and restarting (not rebooting). This has actually been a huge problem for me for those cards, so I wrote a simple “mem-copy” program that will report whether or not the card is in a borked state. (See the code in that thread.) I suggest that you try running this little “borked” kernel for both 128B and 64B memory txns and see if you’re borked.

Duane

alexish · July 29, 2010, 4:56pm

Duane,

without any option the default target is sm1.0. Actually sm1.0 and 1.1 works but any compute capability greater than 1.1 will end in a correctness issue

I tried the code, i am not borked. I experienced a correctness problem too but with two tesla C1060, my code worked perfectly on my 280s (the same i am using now) but had some random memory access problems with the teslas. The telsas started to crash with some SDK 3.0 examples (strangly not before version 3.0) and at this point i suspected an hardware problem. PNY change my two cards and now everything it’s ok, so it was an hardware problem.

Alexis.

dgm4d · July 29, 2010, 6:31pm

Yeah, that bork-tester was written to diagnose a problem with streaming 128B stores, and SRTS radis sorting uses more random 32B scatters.

Can you (1) see if SM1.3 gen’d code runs okay on one of your 1060s, and (2) provide me with a zipped sample list of uints that causes incorrect results on your 280 (i.e., modify the simple test code to print out keys before sorting)?

Duane

dgm4d · July 29, 2010, 6:43pm

Btw, the big difference between SM1.0 and 1.3 gen’d code is that the former has different logic for making aligned scatters, whereas the latter just let’s the hardware do the work. Otherwise it’s ostensibly the same code.

Duane

dgm4d · August 10, 2010, 6:17am

I just updated the sources with a new revision (1.0.181). The changes include:

[*]Early-exit digit-passes. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

[*]Refactorization to improve usability

[*]Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.

-Duane

dgm4d · August 10, 2010, 6:17am

I just updated the sources with a new revision (1.0.181). The changes include:

[*]Early-exit digit-passes. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

[*]Refactorization to improve usability

[*]Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.

-Duane

jan.heckman · August 10, 2010, 10:57pm

Thanks, Duane,

The simple test now works on my 275, too (up to about 5e7 int or float elements, keys only, about 60 msecs), which is just unbelievable.

My only “less desirable” is that the executable gets bloated and compiling takes a bit long.

jan.heckman · August 10, 2010, 10:57pm

Thanks, Duane,

The simple test now works on my 275, too (up to about 5e7 int or float elements, keys only, about 60 msecs), which is just unbelievable.

My only “less desirable” is that the executable gets bloated and compiling takes a bit long.

dgm4d · August 11, 2010, 6:07am

Thanks!

And wow, what have you done to your 275?!? Our stock 285 averages 50M 32-bit uints in 81ms (617M keys/s). If you’re hitting 60ms, that’s 833M keys/s!! I’m wondering if (a) your 275 is warmed up a little External Media , or (b ) not all of the 4-bit digit-places in your keys are filled with random bits, allowing us to short-circuit a pass or two. ( Hopefully not Â©, which is that something is broken… )

Yes, the design will generate a pair of kernels for each digit-place-pass, which generates a little bloat. The reduction/counting kernel is the biggest offender – it’s very aggressively unrolled. (The “test-simple” program actually compiles two sets: keys-only, and key-value, which may not be necessary for a given application.)

The idea was to push the performance as far as we could without resorting to native assembler: passing radix-digits, digit-place, etc. as template parameters instead of formal parameters facilitates constant-propagation which allows us to eliminate (1) unnecessary instructions and (2) not to have to burn registers holding statically-known values. (And we’ve pushed the register pressure to the limit, which is why we feel some pain with 64-bit device pointers and the 3.1 ABI if it’s enabled.)

We’re working with Nathan and Jared to incorporate it into Thrust, allowing it to empower the masses. If the codegen proves excessive, we may have to detune it some for that application.

Duane

dgm4d · August 11, 2010, 6:07am

Thanks!

And wow, what have you done to your 275?!? Our stock 285 averages 50M 32-bit uints in 81ms (617M keys/s). If you’re hitting 60ms, that’s 833M keys/s!! I’m wondering if (a) your 275 is warmed up a little External Media , or (b ) not all of the 4-bit digit-places in your keys are filled with random bits, allowing us to short-circuit a pass or two. ( Hopefully not Â©, which is that something is broken… )

Yes, the design will generate a pair of kernels for each digit-place-pass, which generates a little bloat. The reduction/counting kernel is the biggest offender – it’s very aggressively unrolled. (The “test-simple” program actually compiles two sets: keys-only, and key-value, which may not be necessary for a given application.)

The idea was to push the performance as far as we could without resorting to native assembler: passing radix-digits, digit-place, etc. as template parameters instead of formal parameters facilitates constant-propagation which allows us to eliminate (1) unnecessary instructions and (2) not to have to burn registers holding statically-known values. (And we’ve pushed the register pressure to the limit, which is why we feel some pain with 64-bit device pointers and the 3.1 ABI if it’s enabled.)

We’re working with Nathan and Jared to incorporate it into Thrust, allowing it to empower the masses. If the codegen proves excessive, we may have to detune it some for that application.

Duane

jan.heckman · August 11, 2010, 12:19pm

Thanks!

And wow, what have you done to your 275?!? Our stock 285 averages 50M 32-bit uints in 81ms (617M keys/s). If you’re hitting 60ms, that’s 833M keys/s!! I’m wondering if (a) your 275 is warmed up a little External Media , or (b ) not all of the 4-bit digit-places in your keys are filled with random bits, allowing us to short-circuit a pass or two. ( Hopefully not Â©, which is that something is broken… )

Yes, the design will generate a pair of kernels for each digit-place-pass, which generates a little bloat. The reduction/counting kernel is the biggest offender – it’s very aggressively unrolled. (The “test-simple” program actually compiles two sets: keys-only, and key-value, which may not be necessary for a given application.)

The idea was to push the performance as far as we could without resorting to native assembler: passing radix-digits, digit-place, etc. as template parameters instead of formal parameters facilitates constant-propagation which allows us to eliminate (1) unnecessary instructions and (2) not to have to burn registers holding statically-known values. (And we’ve pushed the register pressure to the limit, which is why we feel some pain with 64-bit device pointers and the 3.1 ABI if it’s enabled.)

We’re working with Nathan and Jared to incorporate it into Thrust, allowing it to empower the masses. If the codegen proves excessive, we may have to detune it some for that application.

Duane

Hi Duane,

I didn’t do that much to the 275, i.e. gpu clock 703, memory 1280 and shader 1476 Mhz. The 32 bits variant of the test program does an average of 61 msecs, the 64 bit variant 59 msecs. Dont’t understand the diffference. The 275 is an msi with two fans, set to keep the temp below 75 C, but a bit higher will not hurt it. High temps occur with e.g. folding at home or L Chien’s improved volkov matmul. Typical is that not only the gpu load as reported by GPU-Z will go to 100%, but the memory controller load has to go high as well for high temps. With your radix sort the memory load is low (like 13 %), the gpu-load is about 75%, so the temp is now < 60 C at sustained sorting load (1000 iterations).

Sorting won’t burn my card, whereas some tuned matmul variants heat it (80+ C) and furmark might kill it. Furmark is opengl and as noted in http://forums.nvidia.com/index.php?showtop…t=#entry1101178, cuda probably does not use all of the card’s hardware, so you can afford a little more tuning than stable with furmark (?).

The only thing left to report is the fact that a larger sorting problem (like 60M) will freeze my machine totally. This sort of thing happens quite often when doing cuda/opencl with overly large demands. I haven’t seen an allocation error reported so far. Part of the problem might be that the card has to do screenoutput too, so the available memory could be uncertain. With 50 M and 30 MB used for other tasks, the memory use goes up first to 444 MB and very quickly to 826 MB. So for general use, it might be best to avoid this. I put in some checks in BaseRadixSortingEnactor<K, V>::EnactSort() and that solved the (minor) problem.

A bit of testing puts the # i can do at 54558720 uints, which by itself is just about 208 mb, about 1/4 of the available memory. Given the potential in terms of sorting speed, memory is a bit of a bottleneck.

Jan