I understand that sorting is a primitive algorithm on GPU. There are many different implementations, e.g., radix sorting,
merge sorting, etc.
With many different sorting algorithm, I am not quite sure which one does the best performance. My question is what is
the fastest sorting algorithm on GPU currently.
thrust::sort() using device pointers is very fast for large arrays of primitive types (radix sort). About 100x faster than 1 single threaded 4.5 GHz using stl:sort() for an array of 32-bit floats with 2^27 elements.
currently thrust just incorporates cub radix sort, providing simpler interface. simpler in both means. so if you mad for speed, you can use moderngpu for merge sort, cub for radix sort. commercial libs may be faster - otherwise you will hardly buy it :)
I have found that the choice of GPU and the configuration make quite a big difference in performance, at least it does when using the thrust library.
For a test case of an randomly generated array of 268,435,456 elements (32-bit float) using CUDA 8 Windows 7 x64;
All times include the memory copies from host to device (pinned), the calling of sort, and the copy back of the result from device to host.
In this case the total memory copy time is about 171 ms for the total memory operations, so any time above that is the total sorting time on 1 GPU.
For a GTX Titan X using TCC driver;
Num Elements= 268435456
GPU timing: 0.286 seconds.
CPU timing: 26.385 seconds.
Error= 0
Max Difference= 0
So the total sorting time GTX Titan X using thrust::sort() using pointers is (286-171)= 115 ms for the sort.
Now if I use the GTX 1080 which is connected to the display and uses the WDDM driver;
Num Elements= 268435456
GPU timing: 1.241 seconds.
CPU timing: 26.355 seconds.
Error= 0
Max Difference= 0
So the total sorting time GTX 1080 using thrust::sort() using pointers is (1241-171)= 1070 ms for the sort.
So using the Titan X is about 9.3 times faster for a large sort than using the GTX 1080, which is disappointing but illustrates how the system configuration can make a huge difference.
Just ran that code on my laptop using a GTX 980M without any memory clock boost and somehow the GTX 980M is much faster than the GTX 1080 even though both are connected to the display and use the WDDM driver;
Num Elements= 268435456
GPU timing: 0.483 seconds.
CPU timing: 33.751 seconds.
Error= 0
Max Difference= 0
So the GTX 980M is over 2x faster than the GTX 1080? WeirdâŠ
sorting libs are not yet optimized for pascal gpus, so they may take wrong computation paths. for example optimize for sm>=5.3 as for sm 5.3 (low-end mobile gpu). you may report that to developers and iâm sure that they will make 1080 faster than any old titans. one problem is what thrust just employs cub code for radix sort, and may be not the lastest version. so i suggest you to install cub and go benchmark/bugreport it
I thought that maybe the performance difference between the Pascal GTX 1080 and the Maxwell GTX Titan X for that large 1GB array may be related to the memory bandwidth issue (the bug I filed which NIVIDA fixed then âunfixedâ).
But even if I reduce the size of the array by half (staying well below the 1 GB level) there still is a huge difference in performance.
All recorded time include host-device and device-host copies in addition to the device sorting time. For this 2^27 size looking at the results from profiling in nvprof the total memory copy overhead is ~84 ms for either GPU.
GTX Titan X;
Num Elements= 134217728
GPU timing: 0.143 seconds.
CPU timing: 12.817 seconds.
Error= 0
Max Difference= 0
GTX 1080;
Num Elements= 134217728
GPU timing: 0.595 seconds.
CPU timing: 12.786 seconds.
Error= 0
Max Difference= 0
Thrust does indeed seem to be calling cub radix sort under the hood.
If you look in detail to the profiling output you can see that the launch configurations are different for the various steps of the sort. This supports Bulatâs theory that maybe thrust (or CUB) has not been optimized for Pascal.
Wait a second I see that issue is that there is a large unexplained gap which makes up most the performance difference for the GTX 1080.
Take a look at lines 10-11 in each profiling output. For the GTX 1080 there is a 460 ms gap between the first host-device copy and the start of the sort, while for the GTX Titan X the gap is only as long as the memory copy.
The GTX 980M also does not have this gap, so it seems unlikely this is a WDDM issue alone.
Ok, I updated the test to average over 8 iterations (host-device copy of original array, sort, device-host copy of result) and it does appear that the GTX 1080 does have some extra âinitializationâ time.
Changed to this;
const int num_iter=8;
cout<<"\nnumber of sorting iterations= "<<num_iter<<'\n';
wTimerRes = 0;
init = InitMMTimer(wTimerRes);
startTime = timeGetTime();
for(int i=0;i<num_iter;i++){
err=cudaMemcpy(D_Arr,H_Arr,num_bytes,_HTD);
if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
thrust::device_ptr<float> D_p=thrust::device_pointer_cast(D_Arr);
thrust::sort(D_p,D_p+num_elem);
err=cudaMemcpy(H_gpu_result,D_Arr,num_bytes,_DTH);
if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
err=cudaDeviceSynchronize();
if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
}
endTime = timeGetTime();
GPU_time=endTime-startTime;
cout<<"GPU timing average: "<<(double(GPU_time)/1000.0)/double(num_iter)<<" seconds including all memory copies both directions.\n";
GTX Titan X;
Num Elements= 134217728
Using device number 0 which is a GeForce GTX TITAN X
number of sorting iterations= 8
GPU timing average: 0.143375 seconds including all memory copies both directions.
CPU timing: 12.798 seconds.
Error= 0
Max Difference= 0
GTX 1080;
Num Elements= 134217728
Using device number 1 which is a GeForce GTX 1080
number of sorting iterations= 8
GPU timing average: 0.254875 seconds including all memory copies both directions.
CPU timing: 12.817 seconds.
Error= 0
Max Difference= 0
since each time the memory copies take about 84 ms then the GTX Titan X average sorting time for this size is (143-84)= 59 ms while the GTX 1080 is (254-84)= 170 ms.
While the GTX 1080 time on average is faster than the original times I posted the GTX Titan X is still over 2x faster for the sort.
What is weird is that the GTX 980M does not have that initial 460 ms of overhead on the first call even though it is also connected to the display, but the GTX 1080 does consistently have that large overhead.
I do get that the average time for the GTX 1080 is being influenced by the âfirstâ call, I just do not understand why Maxwell GPUs do not have that overhead.