SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting
  1 / 4    
Hi all,

We'd like to share with everyone our radix sorting implementation, [url="http://code.google.com/p/back40computing/wiki/RadixSorting"]SRTS Radix Sort[/url]. You can find the code, information, and performance results for a variety of GPUs and input types/sizes on our Google Code project site, [url="http://code.google.com/p/back40computing/"]Back40Computing[/url].

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).

The radix sorting method currently seems to be the most amenable for parallelization on high-bandwith architectures, and the fastest sorting implementations for numeric keys on modern processor architectures are radix sorting designs. In the classic apples-to-oranges comparison, the state-of-the-art sorting rates are (at the moment):
[list]
[*]Intel radix sort on quad-core Core i7: [b]240M 32-bit keys / second[/b] ([url="http://portal.acm.org/citation.cfm?doid=1807167.1807207"]paper[/url])
[*]Intel radix sort on 32-core Knight's Ferry MIC (Larrabee successor): [b]560M 32-bit keys / sec[/b] ([url="http://techresearch.intel.com/userfiles/en-us/FASTsort_CPUsGPUs_IntelMICarchitectures.pdf"]paper[/url])
[*]Our SRTS radix sort on GTX480: [b][color="#000080"]1,005M 32-bit keys/sec[/color][/b]
[/list]
And, unlike the former two, you can walk in off the street and buy a GTX480 and download our source code and evaluate them yourself.

In addition to performance, one of our design goals for this project is flexibility. We've designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types. See the [url="http://code.google.com/p/back40computing/wiki/RadixSorting"]project site[/url] for a (growing) compilation of performance results on a variety of devices and input types.

Although it's slightly outdated since our refactoring to accommodate the Fermi architecture, we request that if you use/reference/benchmark this code, please cite our [url="http://www.cs.virginia.edu/~dgm4d/papers/RadixSortTR.pdf"]Technical Report[/url]:

[code]
@TechReport{ Merrill:Sorting:2010,
author = "Duane Merrill and Andrew Grimshaw",
title = "Revisiting Sorting for GPGPU Stream Architectures",
year = "2010",
institution = "University of Virginia, Department of Computer Science",
address = "Charlottesville, VA, USA",
number = "CS2010-03"
}[/code]

Cheers!

Duane
Hi all,



We'd like to share with everyone our radix sorting implementation, SRTS Radix Sort. You can find the code, information, and performance results for a variety of GPUs and input types/sizes on our Google Code project site, Back40Computing.



This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).



The radix sorting method currently seems to be the most amenable for parallelization on high-bandwith architectures, and the fastest sorting implementations for numeric keys on modern processor architectures are radix sorting designs. In the classic apples-to-oranges comparison, the state-of-the-art sorting rates are (at the moment):


  • Intel radix sort on quad-core Core i7: 240M 32-bit keys / second (paper)
  • Intel radix sort on 32-core Knight's Ferry MIC (Larrabee successor): 560M 32-bit keys / sec (paper)
  • Our SRTS radix sort on GTX480: 1,005M 32-bit keys/sec


And, unlike the former two, you can walk in off the street and buy a GTX480 and download our source code and evaluate them yourself.



In addition to performance, one of our design goals for this project is flexibility. We've designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types. See the project site for a (growing) compilation of performance results on a variety of devices and input types.



Although it's slightly outdated since our refactoring to accommodate the Fermi architecture, we request that if you use/reference/benchmark this code, please cite our Technical Report:





@TechReport{ Merrill:Sorting:2010,

author = "Duane Merrill and Andrew Grimshaw",

title = "Revisiting Sorting for GPGPU Stream Architectures",

year = "2010",

institution = "University of Virginia, Department of Computer Science",

address = "Charlottesville, VA, USA",

number = "CS2010-03"

}




Cheers!



Duane

#1
Posted 07/26/2010 05:28 AM   
Awesome work Duane, keep it up!

I read an older version of your paper and really liked the general philosophy of identifying memory bandwidth as the bottleneck, merging stages of the algorithm to reduce memory traffic, actually measuring the theoretical and achieved memory and compute throughput, and trying to tune the algorithm one way or another to hit both of the limits at the same time. Most of the programming that I do stops when I hit one or the other. It is great that you have shown how you can take it an extra step.

For this reason, I think that this paper is a good read even for those people who care about writing high performance CUDA programming, but don't give a damn about sorting.

I'm curious to hear what you are planning to do next.

[quote]Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementation...[/quote]

If you have some free time and want to get people to use your code, try checking out thrust, replacing the current implementation with yours, and submitting it to nathan and jared. The thrust license seems compatible with your project's and I would use your code tomorrow if I could just re-sync thrust and get a speedup.

Also, I saw that you have a poster at PACT this year, drop me a line at if you end up going.
Awesome work Duane, keep it up!



I read an older version of your paper and really liked the general philosophy of identifying memory bandwidth as the bottleneck, merging stages of the algorithm to reduce memory traffic, actually measuring the theoretical and achieved memory and compute throughput, and trying to tune the algorithm one way or another to hit both of the limits at the same time. Most of the programming that I do stops when I hit one or the other. It is great that you have shown how you can take it an extra step.



For this reason, I think that this paper is a good read even for those people who care about writing high performance CUDA programming, but don't give a damn about sorting.



I'm curious to hear what you are planning to do next.



Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementation...




If you have some free time and want to get people to use your code, try checking out thrust, replacing the current implementation with yours, and submitting it to nathan and jared. The thrust license seems compatible with your project's and I would use your code tomorrow if I could just re-sync thrust and get a speedup.



Also, I saw that you have a poster at PACT this year, drop me a line at if you end up going.

#2
Posted 07/26/2010 09:19 AM   
Wow... congratulations!

One comment, I was about to ask why the C2050 (no ECC) is ~33% slower than the GTX 480 but after double-checking the product specs it's clear why: C2050:448@1150MHz vs. GTX480:480@1400MHz = 77% of a 480. That's some great scaling!
Wow... congratulations!



One comment, I was about to ask why the C2050 (no ECC) is ~33% slower than the GTX 480 but after double-checking the product specs it's clear why: C2050:448@1150MHz vs. GTX480:480@1400MHz = 77% of a 480. That's some great scaling!

#3
Posted 07/26/2010 12:46 PM   
Any plans for MultiGPU and optimisation to the GTX 460?
Any plans for MultiGPU and optimisation to the GTX 460?

#4
Posted 07/28/2010 08:20 AM   
Great work !

I just downloaded and compiled your code on my linux box (Ubuntu 10.04 and Cuda 3.1). In 64 bits everything run fine, but in 32 bits the program failed to launch with the error : "Cuda driver version is insufficient for Cuda runtime version" ?
Great work !



I just downloaded and compiled your code on my linux box (Ubuntu 10.04 and Cuda 3.1). In 64 bits everything run fine, but in 32 bits the program failed to launch with the error : "Cuda driver version is insufficient for Cuda runtime version" ?

#5
Posted 07/28/2010 04:47 PM   
[quote name='moozoo' post='1094788' date='Jul 28 2010, 04:20 AM']Any plans for MultiGPU and optimisation to the GTX 460?[/quote]

I have some ideas for how to adapt it for both (1) multi-gpu and (2) keys larger than shiftable machine words (e.g., the 10B key format for the Jim Gray [url="http://sortbenchmark.org/"]sortbenchmark.org[/url] database sorting challenges). These features are not specific research objectives on my dissertation roadmap, but I think they're interesting and have real-world impact. So... hopefully. But if anyone else wants to take this code and run with it in those directions, please, do so!

As for the GTX460, GF104 has some interesting scheduling properties that differentiate it from GF100. I won't know if there's a sweeter "sweet spot" that can be selected for it in terms of active worker warps or grid sizes until we actually get one in our lab. (Which will hopefully be very soon.) But I suspect that performance using the current GF100 config will be quite reasonable.

Duane
[quote name='moozoo' post='1094788' date='Jul 28 2010, 04:20 AM']Any plans for MultiGPU and optimisation to the GTX 460?



I have some ideas for how to adapt it for both (1) multi-gpu and (2) keys larger than shiftable machine words (e.g., the 10B key format for the Jim Gray sortbenchmark.org database sorting challenges). These features are not specific research objectives on my dissertation roadmap, but I think they're interesting and have real-world impact. So... hopefully. But if anyone else wants to take this code and run with it in those directions, please, do so!



As for the GTX460, GF104 has some interesting scheduling properties that differentiate it from GF100. I won't know if there's a sweeter "sweet spot" that can be selected for it in terms of active worker warps or grid sizes until we actually get one in our lab. (Which will hopefully be very soon.) But I suspect that performance using the current GF100 config will be quite reasonable.



Duane

#6
Posted 07/28/2010 05:42 PM   
[quote name='alexish' post='1095012' date='Jul 28 2010, 12:47 PM']Great work !

I just downloaded and compiled your code on my linux box (Ubuntu 10.04 and Cuda 3.1). In 64 bits everything run fine, but in 32 bits the program failed to launch with the error : "Cuda driver version is insufficient for Cuda runtime version" ?[/quote]


I have this same problem on one of the 64-bit machines machines in our lab (but strangely enough, not the others), and I have not yet gotten to the bottom of it.

In fact, I get the same error on this machine when compiling the SDK examples using "make i386=1" (which is the option for compiling the SDK for 32-bit machine/device code on 64-bit machines). Compiling this way is no problem for our other Ubuntu machines. The only difference that I can discern between our "good" machines for 32-bit compilation and this "bad" one is that the "bad" one only has gcc 4.2.3 (instead of 4.2.4). IDK; I'll will know more soon.

We'll have to resolve this issue ourselves sometime next week, so I'll keep you posted.

Duane
[quote name='alexish' post='1095012' date='Jul 28 2010, 12:47 PM']Great work !



I just downloaded and compiled your code on my linux box (Ubuntu 10.04 and Cuda 3.1). In 64 bits everything run fine, but in 32 bits the program failed to launch with the error : "Cuda driver version is insufficient for Cuda runtime version" ?





I have this same problem on one of the 64-bit machines machines in our lab (but strangely enough, not the others), and I have not yet gotten to the bottom of it.



In fact, I get the same error on this machine when compiling the SDK examples using "make i386=1" (which is the option for compiling the SDK for 32-bit machine/device code on 64-bit machines). Compiling this way is no problem for our other Ubuntu machines. The only difference that I can discern between our "good" machines for 32-bit compilation and this "bad" one is that the "bad" one only has gcc 4.2.3 (instead of 4.2.4). IDK; I'll will know more soon.



We'll have to resolve this issue ourselves sometime next week, so I'll keep you posted.



Duane

#7
Posted 07/28/2010 05:57 PM   
[quote name='Gregory Diamos' post='1093661' date='Jul 26 2010, 05:19 AM']If you have some free time and want to get people to use your code, try checking out thrust, replacing the current implementation with yours, and submitting it to nathan and jared. The thrust license seems compatible with your project's and I would use your code tomorrow if I could just re-sync thrust and get a speedup.[/quote]

I believe Nathan is actively working on this. I like the Thrust model of "only generate kernels that your application needs" for a variety of reasons, so hopefully they will find it to be a good match.

[quote name='Gregory Diamos' post='1093661' date='Jul 26 2010, 05:19 AM']Also, I saw that you have a poster at PACT this year, drop me a line at if you end up going.[/quote]

I am going, and would love to chat.

Duane
[quote name='Gregory Diamos' post='1093661' date='Jul 26 2010, 05:19 AM']If you have some free time and want to get people to use your code, try checking out thrust, replacing the current implementation with yours, and submitting it to nathan and jared. The thrust license seems compatible with your project's and I would use your code tomorrow if I could just re-sync thrust and get a speedup.



I believe Nathan is actively working on this. I like the Thrust model of "only generate kernels that your application needs" for a variety of reasons, so hopefully they will find it to be a good match.



[quote name='Gregory Diamos' post='1093661' date='Jul 26 2010, 05:19 AM']Also, I saw that you have a poster at PACT this year, drop me a line at if you end up going.



I am going, and would love to chat.



Duane

#8
Posted 07/28/2010 06:02 PM   
[quote]In fact, I get the same error on this machine when compiling the SDK examples using "make i386=1" (which is the option for compiling the SDK for 32-bit machine/device code on 64-bit machines). Compiling this way is no problem for our other Ubuntu machines. The only difference that I can discern between our "good" machines for 32-bit compilation and this "bad" one is that the "bad" one only has gcc 4.2.3 (instead of 4.2.4). IDK; I'll will know more soon.[/quote]

Indeed its very strange ... I work with a newly installed ubuntu 10.04 with gcc 4.4.3.

I can run the SDK radix sort compiled in 64 bits on all my graphic cards (GTX 480, GTX 280 and 9500GT).
Here are the results of the benchmarks i just done for 1M elements:
GTX480 ---- SDK 0.253 10^9 elts/s ---- SRTS 0.53 10^9 elts/s
9500GT ---- SDK 0.015 10^9 elts/s ---- SRTS 0.032 10^9 elts/s

But i have some problems with the SRTS on my GTX280:
First try:
[quote]Using device 1: GeForce GTX 280
key-value, 1 iterations, 1048576 elements, 6.483040 GPU ms, 0.161741 x10^9 elts/sec
Incorrect: [4]: 0 < 8

[...0, 0, 0, 8, 0, 0, 0, 0, 28, ...][/quote]

Successive tests with 1M elts:
[quote]Using device 1: GeForce GTX 280
key-value, 1 iterations, 1048576 elements Cuda error in file 'test_radix_sort_simple.cu' in line 183 : unspecified launch failure.[/quote]

If i reduce the number of elements to 100000 i got incorrect sorts each time:
[quote]key-value, 1 iterations, 104857 elements, 0.936128 GPU ms, 0.112011 x10^9 elts/sec
Incorrect: [3]: 6 < 131072

[...0, 131072, 131072, 6, 0, 0, 0, 0, ...][/quote]

Update:
the problem i encounter with the GTX 280 is related to the use of the gencode option. Any attempt to use this option even with only the 1.3 architecture results in the problem described before. If i don't use gencode at all the code run on the GTX280 and sort 1M at 0.2 10^9 elts/s (respect to 0.115 10^9 elts/s for the SDK sort)

For the 32bits pb i have no clue.....
In fact, I get the same error on this machine when compiling the SDK examples using "make i386=1" (which is the option for compiling the SDK for 32-bit machine/device code on 64-bit machines). Compiling this way is no problem for our other Ubuntu machines. The only difference that I can discern between our "good" machines for 32-bit compilation and this "bad" one is that the "bad" one only has gcc 4.2.3 (instead of 4.2.4). IDK; I'll will know more soon.




Indeed its very strange ... I work with a newly installed ubuntu 10.04 with gcc 4.4.3.



I can run the SDK radix sort compiled in 64 bits on all my graphic cards (GTX 480, GTX 280 and 9500GT).

Here are the results of the benchmarks i just done for 1M elements:

GTX480 ---- SDK 0.253 10^9 elts/s ---- SRTS 0.53 10^9 elts/s

9500GT ---- SDK 0.015 10^9 elts/s ---- SRTS 0.032 10^9 elts/s



But i have some problems with the SRTS on my GTX280:

First try:

Using device 1: GeForce GTX 280

key-value, 1 iterations, 1048576 elements, 6.483040 GPU ms, 0.161741 x10^9 elts/sec

Incorrect: [4]: 0 < 8



[...0, 0, 0, 8, 0, 0, 0, 0, 28, ...]




Successive tests with 1M elts:

Using device 1: GeForce GTX 280

key-value, 1 iterations, 1048576 elements Cuda error in file 'test_radix_sort_simple.cu' in line 183 : unspecified launch failure.




If i reduce the number of elements to 100000 i got incorrect sorts each time:

key-value, 1 iterations, 104857 elements, 0.936128 GPU ms, 0.112011 x10^9 elts/sec

Incorrect: [3]: 6 < 131072



[...0, 131072, 131072, 6, 0, 0, 0, 0, ...]




Update:

the problem i encounter with the GTX 280 is related to the use of the gencode option. Any attempt to use this option even with only the 1.3 architecture results in the problem described before. If i don't use gencode at all the code run on the GTX280 and sort 1M at 0.2 10^9 elts/s (respect to 0.115 10^9 elts/s for the SDK sort)



For the 32bits pb i have no clue.....

#9
Posted 07/29/2010 11:26 AM   
Alexish,

[quote name='alexish' post='1095547' date='Jul 29 2010, 07:26 AM']the problem i encounter with the GTX 280 is related to the use of the gencode option. Any attempt to use this option even with only the 1.3 architecture results in the problem described before. If i don't use gencode at all the code run on the GTX280 and sort 1M at 0.2 10^9 elts/s (respect to 0.115 10^9 elts/s for the SDK sort)[/quote]

When you compile without the gencode specified, what SM target does ptxas report for you? In general, you'll want to use the gencode option for 1.3 -- or at least make sure from the verbose ptxas output that your kernels are generated for SM1.3. (The code is compiled very differently for SM1.0/1.1, and will be much much slower on GT200.)

As for your problem with SM1.3 specified: normally I would assume there to be a correctness issue with the sorting routines, but (1) it works on all of our GT200 architectures, e.g.:

[code]: ~/B40CSrtsRadixSort>; ./bin/test_radix_sort_simple_3.0 --n=1048576 --i=100
Using device 0: GeForce GTX 280
key-value, 100 iterations, 1048576 elements, 3.754069 GPU ms, 0.279317 x10^9 elts/sec
Correct[/code]

and (2), I've actually experienced correctness issues with GT200 architectures (GTX280, 285) in terms of the hardware not performing st operations correctly. See [url="http://forums.nvidia.com/index.php?showtopic=107786&hl="]http://forums.nvidia.com/index.php?showtopic=107786&hl=[/url]. The hardware can get "into a bad state" that seems to only be fixable by powering down and restarting (not rebooting). This has actually been a huge problem for me for those cards, so I wrote a simple "mem-copy" program that will report whether or not the card is in a borked state. (See the code in that thread.) I suggest that you try running this little "borked" kernel for both 128B and 64B memory txns and see if you're borked.

Duane
Alexish,



[quote name='alexish' post='1095547' date='Jul 29 2010, 07:26 AM']the problem i encounter with the GTX 280 is related to the use of the gencode option. Any attempt to use this option even with only the 1.3 architecture results in the problem described before. If i don't use gencode at all the code run on the GTX280 and sort 1M at 0.2 10^9 elts/s (respect to 0.115 10^9 elts/s for the SDK sort)



When you compile without the gencode specified, what SM target does ptxas report for you? In general, you'll want to use the gencode option for 1.3 -- or at least make sure from the verbose ptxas output that your kernels are generated for SM1.3. (The code is compiled very differently for SM1.0/1.1, and will be much much slower on GT200.)



As for your problem with SM1.3 specified: normally I would assume there to be a correctness issue with the sorting routines, but (1) it works on all of our GT200 architectures, e.g.:



: ~/B40CSrtsRadixSort>; ./bin/test_radix_sort_simple_3.0 --n=1048576 --i=100

Using device 0: GeForce GTX 280

key-value, 100 iterations, 1048576 elements, 3.754069 GPU ms, 0.279317 x10^9 elts/sec

Correct




and (2), I've actually experienced correctness issues with GT200 architectures (GTX280, 285) in terms of the hardware not performing st operations correctly. See http://forums.nvidia.com/index.php?showtopic=107786&hl=. The hardware can get "into a bad state" that seems to only be fixable by powering down and restarting (not rebooting). This has actually been a huge problem for me for those cards, so I wrote a simple "mem-copy" program that will report whether or not the card is in a borked state. (See the code in that thread.) I suggest that you try running this little "borked" kernel for both 128B and 64B memory txns and see if you're borked.



Duane

#10
Posted 07/29/2010 04:08 PM   
Duane,

[quote name='dgm4d' post='1095667' date='Jul 29 2010, 06:08 PM']When you compile without the gencode specified, what SM target does ptxas report for you? In general, you'll want to use the gencode option for 1.3 -- or at least make sure from the verbose ptxas output that your kernels are generated for SM1.3. (The code is compiled very differently for SM1.0/1.1, and will be much much slower on GT200.)[/quote]
without any option the default target is sm1.0. Actually sm1.0 and 1.1 works but any compute capability greater than 1.1 will end in a correctness issue

[quote]and (2), I've actually experienced correctness issues with GT200 architectures (GTX280, 285) in terms of the hardware not performing st operations correctly. See [url="http://forums.nvidia.com/index.php?showtopic=107786&hl="]http://forums.nvidia.com/index.php?showtopic=107786&hl=[/url]. The hardware can get "into a bad state" that seems to only be fixable by powering down and restarting (not rebooting). This has actually been a huge problem for me for those cards, so I wrote a simple "mem-copy" program that will report whether or not the card is in a borked state. (See the code in that thread.) I suggest that you try running this little "borked" kernel for both 128B and 64B memory txns and see if you're borked.[/quote]
I tried the code, i am not borked. I experienced a correctness problem too but with two tesla C1060, my code worked perfectly on my 280s (the same i am using now) but had some random memory access problems with the teslas. The telsas started to crash with some SDK 3.0 examples (strangly not before version 3.0) and at this point i suspected an hardware problem. PNY change my two cards and now everything it's ok, so it was an hardware problem.

Alexis.
Duane,



[quote name='dgm4d' post='1095667' date='Jul 29 2010, 06:08 PM']When you compile without the gencode specified, what SM target does ptxas report for you? In general, you'll want to use the gencode option for 1.3 -- or at least make sure from the verbose ptxas output that your kernels are generated for SM1.3. (The code is compiled very differently for SM1.0/1.1, and will be much much slower on GT200.)

without any option the default target is sm1.0. Actually sm1.0 and 1.1 works but any compute capability greater than 1.1 will end in a correctness issue



and (2), I've actually experienced correctness issues with GT200 architectures (GTX280, 285) in terms of the hardware not performing st operations correctly. See http://forums.nvidia.com/index.php?showtopic=107786&hl=. The hardware can get "into a bad state" that seems to only be fixable by powering down and restarting (not rebooting). This has actually been a huge problem for me for those cards, so I wrote a simple "mem-copy" program that will report whether or not the card is in a borked state. (See the code in that thread.) I suggest that you try running this little "borked" kernel for both 128B and 64B memory txns and see if you're borked.


I tried the code, i am not borked. I experienced a correctness problem too but with two tesla C1060, my code worked perfectly on my 280s (the same i am using now) but had some random memory access problems with the teslas. The telsas started to crash with some SDK 3.0 examples (strangly not before version 3.0) and at this point i suspected an hardware problem. PNY change my two cards and now everything it's ok, so it was an hardware problem.



Alexis.

#11
Posted 07/29/2010 04:56 PM   
[quote name='alexish' post='1095688' date='Jul 29 2010, 12:56 PM']I tried the code, i am not borked. I experienced a correctness problem too but with two tesla C1060, my code worked perfectly on my 280s (the same i am using now) but had some random memory access problems with the teslas. The telsas started to crash with some SDK 3.0 examples (strangly not before version 3.0) and at this point i suspected an hardware problem. PNY change my two cards and now everything it's ok, so it was an hardware problem.[/quote]

Yeah, that bork-tester was written to diagnose a problem with streaming 128B stores, and SRTS radis sorting uses more random 32B scatters.

Can you (1) see if SM1.3 gen'd code runs okay on one of your 1060s, and (2) provide me with a zipped sample list of uints that causes incorrect results on your 280 (i.e., modify the simple test code to print out keys before sorting)?

Duane
[quote name='alexish' post='1095688' date='Jul 29 2010, 12:56 PM']I tried the code, i am not borked. I experienced a correctness problem too but with two tesla C1060, my code worked perfectly on my 280s (the same i am using now) but had some random memory access problems with the teslas. The telsas started to crash with some SDK 3.0 examples (strangly not before version 3.0) and at this point i suspected an hardware problem. PNY change my two cards and now everything it's ok, so it was an hardware problem.



Yeah, that bork-tester was written to diagnose a problem with streaming 128B stores, and SRTS radis sorting uses more random 32B scatters.



Can you (1) see if SM1.3 gen'd code runs okay on one of your 1060s, and (2) provide me with a zipped sample list of uints that causes incorrect results on your 280 (i.e., modify the simple test code to print out keys before sorting)?



Duane

#12
Posted 07/29/2010 06:31 PM   
Btw, the big difference between SM1.0 and 1.3 gen'd code is that the former has different logic for making aligned scatters, whereas the latter just let's the hardware do the work. Otherwise it's ostensibly the same code.

Duane
Btw, the big difference between SM1.0 and 1.3 gen'd code is that the former has different logic for making aligned scatters, whereas the latter just let's the hardware do the work. Otherwise it's ostensibly the same code.



Duane

#13
Posted 07/29/2010 06:43 PM   
I just updated the sources with a new revision (1.0.181). The changes include:

[list]
[*][b]Early-exit digit-passes[/b]. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

[*]Refactorization to improve usability

[*]Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.
[/list]

-Duane
I just updated the sources with a new revision (1.0.181). The changes include:




  • Early-exit digit-passes. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

  • Refactorization to improve usability

  • Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.




-Duane

#14
Posted 08/10/2010 06:17 AM   
I just updated the sources with a new revision (1.0.181). The changes include:

[list]
[*][b]Early-exit digit-passes[/b]. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

[*]Refactorization to improve usability

[*]Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.
[/list]

-Duane
I just updated the sources with a new revision (1.0.181). The changes include:




  • Early-exit digit-passes. This is a huge feature for many sorting problems. When the sorting operation detects that all keys have the same digit at the same digit-place, the pass for that digit-place is short-circuited, reducing the cost of that pass by 80%. This makes our implementation suitable for even low-degree binning problems (where sorting would normally be overkill).

  • Refactorization to improve usability

  • Workaround for a nasty Cuda 3.1 x64 SM1.3 pragma-unroll code generation bug that was causing the incorrect results that Alexish was reporting for this particular configuration.




-Duane

#15
Posted 08/10/2010 06:17 AM   
  1 / 4    
Scroll To Top