Modern GPU

Hello,
Does anyone know if MGPU scan code works on Titan X?

No reason why it would not.

That(MGPU) seems to be an old project, and based on your two posts I cannot help but wonder why you do not use the robust libraries included in the SDK (cuBLAS(), cuSparse(), thrust etc)?

They probably have the functionality you need, are well tested,well documented, and very fast. Also some of them offer multi-gpu support.

If you have a choice between a library developed/supported by NVIDIA you should examine that option first. Most of their libraries are excellent.

Mgpu performs better than THRUST, thatā€™s why Iā€™m using that. The whole project is a set of benchmarks and looks like they havenā€™t included the support for Titan X. I just wanted to make sure if thatā€™s the caseā€¦

Interesting. Is DE Shaw behind this project?

When you say MGPU performs better than thrust, it that across the board or only for some algorithms?

I find it hard to believe that there is a faster sort for large number of primitives than thrust::sort() using device pointers(not their vector class).

But since I have not compared to MGPU I suppose it is possible.
Jimmy Petterson has the fastest sum reduction I have seen, and I wonder how that would do against MGPU. Might have to find outā€¦

Iā€™m guessing this is not DE Shaw since the author of MGPU now works there.

Ok I ran a quick test timing thrust::sort() on 2^28 32 bit random float elements including the host-device copy the sort and the device host copy.
It is compared against the C++ std::sort() which i believe is an implementation of quicksort.

CUDA 6.5, Windows 7, GTX Titan X

Num Elements= 268435456
GPU timing: 0.285 seconds.
CPU timing: 26.257 seconds.

Error= 0

Max Difference= 0

CUDA 7.5, Windows 7, GTX Titan X

Num Elements= 268435456
GPU timing: 0.286 seconds.
CPU timing: 26.344 seconds.

Error= 0

Max Difference= 0

Note: I used pinned host memory so that the host-device and device-host copy could be performed at about 12 GBs across the PCI-e bus.

The host-device transfer time according to nvprof:

18.9587s  88.593ms                    -               -         -         -         -  1.0000GB  11.288GB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]

after the sort the device-host transfer time according to nvprof:

19.1618s  82.352ms                    -               -         -         -         -  1.0000GB  12.143GB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

So the total memory transfer time is 170 ms, and the total overall time is ~286 ms, so that should mean that the actual GPU sort took ~115 ms.

Including memory copies which dominate the running time thrust::sort() is about 92 times faster than std::sort() using an overclocked 4.5 GHz i-7 CPU.

If you do not include the memory transfer times the performance difference reaches a whopping 228 times faster than an overclocked 4.5 GHz CPU.

Also I dug up some older code I wrote which compares three versions of a reduction which returns the minimum double 64 bit value in the array along with the index of that value.

I first benchmarked a standard serial CPU version,

then the following canonical GPU reduction approach;

for(int i=threadIdx.x+blockIdx.x*blockDim.x;i<sz;i+=blockDim.x*gridDim.x){..}

then an implementation of the approach used by Jimmy Pettersson.

I think the bottom most result is good enough to be compared to MGPU, thrust, CUB etc;

Num Elements= 16777343

Starting CPU..

CPU time= 1019.77 , CPU ans= -987654 , index= 9997

CPU throughput= 13.1616 GB/s

Starting GPU0..

Num threads= 256 , total num blocks= 513

GPU time0= 53.8513 , GPU ans= -987654 , index= 9997

GPU throughput0= 249.24 GB/s

[b]Starting GPU1..

Num threads= 256 , total num blocks= 513

GPU time2= 45.7111 , GPU ans2= -987654 , index2= 9997

GPU throughput1= 293.624 GB/s[/b]

The last output group is the JP approach, which I believe is the best approach.

Getting GPU throughput of 293.624 GB/s using 64-bit words for such a min reduction is close to the theoretical maximum global memory bandwidth.

How long does MGPU take to find the min value and corresponding index for an array of doubles that size?

A scan and a reduction are not the same thing, at least in the parlance that I am familiar with.

A scan is a transformation (prefix sum).
A reduction is ā€¦ a reduction.

Was always fuzzy on those definitions, so to be clear I am talking about a reduction.
Thanks for the correction.

Prefix sums are a completely different animal, and the corresponding code is more complex than that in a basic reduction.

you could try out also the CUDPP library (GitHub - cudpp/cudpp: CUDA Data Parallel Primitives Library)
A while ago, its scan routines were faster than the ones from thrust, donā€™t know if thats still true.
Itā€™s also a somewhat older project, but seems still to be maintained and works fine in Cuda Toolkit 7.0

I would like to see a test case for either scan or sorting where MGPU beats thrust.
To be blunt I find that statement very hard to believe.

There is also a newer framework - ā€˜MAPSā€™ - which looks nice.
See
http://www.cs.huji.ac.il/project/maps/
and

Not sure how easy to develop a ā€˜scanā€™ algorithm with this framework, but e.g. highly optimized 2-D convolution or matrix multiplication can be written in a few lines (as can be seen at examples at MAPS: GPU Memory Abstraction and Optimization Framework ).

Iā€™d recommend you to be careful with these kind of comparisons. Itā€™s not exactly sounds to compare a vendor-optimized (non-portable) library implementation with the standard library version of the same algorithm. AFAIK SIMD optimized CPU sort implementations perform AFAIR 2-5x faster than std::sort (e.g. http://www.vldb.org/pvldb/vol8/p1274-inoue.pdf). Use e.g. Intel IPP or ArrayFireā€™s

If you look at Arrayfireā€™s benchmarks (http://arrayfire.com/benchmarking-parallel-vector-libraries/) you can see that e.g. reduction in their tests is only ~2.6x faster on a K20 than on a lower-end 8C IVB Xeon (though this has more bandwidth than your Haswell 4C chip). Even if you factor in a likely 3x performance bump with the TITAN X thatā€™s still far from your hundreds of xā€™es.

Additionally, youā€™re comparing a 90 W TDP chip with a 250 W TDP beast.

Clearly it makes little sense to build up a straw man just so one can knock it down. I agree that performance comparisons between single-threaded scalar CPU codes and highly optimized GPU codes create harmful reality distortion, yet I still see papers that use such comparisons.

In the same vein, I would caution against comparing the power consumption of a CPU by itself with a complete subsystem that consists of a GPU (processor), GBs of high-speed memory, assorted auxiliary circuitry including a DC-DC converter, and a cooling fan. I do not have detailed data about the power draw of the non-GPU components on the Titan X, but wouldnā€™t be surprised if they sum up to 80 watts out of the 250 watt total.

Sorts and reductions are not the most optimal algorithms to map to a GPU, yet they still vastly outperform any CPU based library out there. For sort whether that difference is a factor of 100 or 25, that still is a massive difference.

When it comes to the problems GPUs are good at, such as image processing and brute force then the difference can be a factor of 100 or greater.

If you would like a very simple example, how about examining very distinct 3 index combinations of an array (N choose 3) and evaluate against a test function? For this problem there is a 400 to 500 difference when compared to a fast single thread CPU implementation.

Even if you were able to write a CPU multi-core implementation at the same clock speed and were able to cut that down by a factor of 4 there still is a 100 time performance difference.

https://github.com/OlegKonings/CUDA_brute_triangle

I have yet to see anyone get close, but by all means prove me wrong.

At least I put out there my algorithm benchmarks with source code and list the hardware used.
If you think they are incorrect or you can do better post your code and the hardware used.

As I said, the difference when you compare properly tuned codes is more like 0.5-3x, certainly not double digit. Please check the link I provided and google around a bit ;)

Sure, you can also cherry-pick cases where GPUs will be horrible, e.g. AES latency, small FFTs, many graph algorithms, ā€¦

First of all, that may be a simple example, but itā€™s nonetheless cherry-picked ;)

You seem to forget that CPUs have SIMD units too, up to 256-bit wide these days. Plus workstation and server CPUs have 6-16 cores (thereā€™s even an 18-core CPU from _ntel), so thatā€™s a theoretical 32x and 128x improvement with 4 and 16 cores, respectively. Of course, this is a gross oversimplification that ignores many things like SMT/HT, boost clocks, multi-threaded scaling, etc. and most importantly the choice of algorithm, but should still be enough as warning for you that your 4-500x is questionable as a representative number for how fast your algorithm can run on GPUs vs CPUs.

Having had a quick look at your reference CPU code I have to say, given that it makes no attempt at vectorization, nor does it have vectorizer-friendly code. Plus this seems to be an embarrassingly parallel problem where youā€™ll get linear scaling across cores (excluding frequency scaling effects).

PS: Actually, I think early termination conditions could speed up your algorithm quite a bit. Plus a bit of clever sorting may do a lot. I may be wrong, have not given it much brain-cycles. ;)

While it is desirable to do apples-to-apples comparisons, by comparing ā€œequally optimizedā€ versions of CPU-based and GPU-based codes, I also have some ideas why many authors publishing on GPU accelerated codes do not do so:

(1) Optimizing across the parallelization hierarchy in the CPU world (SIMD vectorization, multi-threading e.g. OMP, multi-process, e.g. MPI) is hard and in many cases much harder than parallelizing for the GPU. At least that is what I learned from my own efforts. The auto-vectorization throws in the towel at the drop of a hat, and vectorizing with SIMD intrinsics is about as much fun as a root canal (and yes, I have first-hand experience with both). OMP looks easy at first, but is full of pitfalls, it is easy to shoot oneself in the foot. MPI is probably the easiest component to deal with.

(2) Yes, one can get multi-socket system with many-core CPUs with high core counts, but the cost of that is quite prohibitive even compared to the cost of an un-discounted Tesla GPU, let alone something like a Titan X. Maybe academia gets great discounts, but try to order such a system, quantity one, as an individual from any of the vendors specialized in those top-of-the-line systems. The matching software can then set you back another couple thousand dollars quite easily.

So in various instances researchers will have neither the time nor the money to invest in a top-of-the line CPU-based solution to compare with their optimized GPU solution achieved at very modest cost in hardware, software, and development time. I find it hard to find fault with that, but it should be pointed out in papers who run into such restrictions. Of course, for published papers a line is crossed when the CPU version used as a baseline isnā€™t even compiled with the best available compiler settings.

For sorting I have seen no results from CPU based solutions which can sort 2^30 or greater floating point values anywhere near the result of thrust::sort().
Maybe for reductions or scans there is only a 3x difference with some fancy MIC parallel implementation but not for sorting primitives.

If you can point me to an open-source parallel CPU library, I can test using a $600 Haswell i7 5930K 3.5 GHz 6 core 12 thread CPU, which would make a good ā€˜dollar-to-dollarā€™ test against a GTX 980 or GTX 980ti which are about the same price.

I do not believe this is true. I have actually worked with two of those examples any found GPU based implementations outperform CPU single thread by a good margin. Sure a multi-threaded version may reduce the performance difference but why bother when in CUDA I can get it going quickly on relatively cheap hardware.

As far as graph algorithms go I have a few examples on my site which show that BFS based graph algorithms and those which can be mapped to a knapsack or Floyd-Warshall type structure can be sped up about a factor of 10 or more. Those particular problems are probably more difficult to map to CPU parallel implementation than to a GPU based implementation.

Two examples of DAG implementation in CUDA;

https://github.com/OlegKonings/CUDA_Dynamic_Programming_Example_6

https://github.com/OlegKonings/DP13_suka

vectorizing that particular quadruple-nested loop in CPU land will not be as easy as you suggest, but go ahead and show me how it is done. Would like to learn such techniques.

And yes there are certainly heuristics which could be added to that code which would reduce the running time, but the example was intended to illustrate the brute force abilities of a GPU, with a simple example which has a large problem space.

Of course pure ā€œapples-to-applesā€ comparisons are difficult to make with difference architectures. From my limited experience with multi-core CPU libraries and programming I concluded that GPUs offer overall significantly better performance (for my problems) with good accuracy,reliability and value.

Again I wrote those simple code samples from scratch and am interested in anything which can compete in terms of performance. Saying it is possible does not mean as much as producing some code which proves comparable performance.

OpenMP parallelization is really easy to add for stuff like low-level image processing code, but choosing the right number of cpu threads in the parallel region is not as it depends on a number of factors. Furthermore, the most common openmp implementation in windows is only openmp 2.0 and seems to have some serious performance related issues, for example it looks like that the cpu threads for a parallel region are created newly instead of being kept up in a thread pool.
Regarding simd intrinsics, i personally dont use them because it looks like assembler and it makes the code unreadable for me. I rely on the autovectorization on the compiler (which unfortunately seems to give up already for small conditional clauses).
My conclusion is that getting highly performant code on the CPU is also not a walk in the park.