What is maximum speed-up that can be obtained with GPU?

Hello, I have developed a solver on GPU using CUDA. I have obtained maximum speed-up of 35x for GPU when compared with CPU. This speed-up varies with size of input data. I want to compare this speed-up with theoretical values. How to calculate theoretical speed-up for GPU ? I also want to know how much maximum speed-up is possible for GPU. I have calculated speed-up as (time required for sequential program/ time required for parallel program). I am using GK208M [GeForce GT 740M] GPU with 2 SMs and 384 cuda cores(192 on each SM), GPU clock rate 1.03 GHz.

As long as you document your methodology and claims, I think you can use more or less any method to publish a speed-up number.

For codes that do the same work, speedup I believe is simply the ratio of the execution times, as you have already indicated. In some situations you may only be interested in certain subsets of the actual code execution time; appropriate timing methodologies can address that.

Many codes are either compute bound or memory bound, and as a rough rule-of-thumb, the GPU offers about 5x higher capacity in either case. Therefore speedups significantly in excess of 5x are often viewed suspiciously or taken with a grain of salt. Furthermore, as a proposed “theoretical value” or predictor of performance, if you know that your code is mostly memory bound, or mostly compute bound, then you can compare the peak numbers for these between the two specific architectures, and use that as a “theoretical best case” or predictor of optimal performance. But see below about CPU multithreading; most modern CPUs cannot saturate their memory bus, or deliver their peak rated memory bandwidth, for codes operating in a single core. A similar statement can be made about compute limited codes. GPUs will also require certain characteristics in their codes to achieve optimal performance. As a simple example, kernel launches that launch either 1 block or 1 thread per block cannot and will not achieve optimal performance by any measure.

If you simply want to document a specific comparison (e.g. single-threaded CPU code vs. parallelized GPU code) there is nothing wrong with publishing your actual measurement, as long as you document how it was done, especially if you provide both code bases for inspection.

If you make any claims about “the GPU is faster than the CPU by XX”, then IMO you are well-advised to compare only codes that do the same work and efficiently and effectively use the underlying architectures (for both CPU and GPU). For example in the CPU case you should certainly be using a multi-threaded code, so as to take advantage of the multiple CPU cores that most modern CPUs offer. These sorts of claims are likely to be viewed with skepticism anyway, so probably best to avoid them unless it is the crux of your intent. If that is your intent, the level of effort you’ve put into your two codes for comparison is almost certainly insufficient, based on your brief description.

Some of the biggest speed-ups might be obtained by running a large number of transcendental functions that would require many instructions on the CPU, but are implemented in a hardware unit (SFU) on the GPU.

Christian

We get typical speedups of 5 - 10 for GPU implementation of computer vision algorithms, compared with multi-threaded CPU implementation. These algortihms are mainly memory bandwidth bound. For more info see e.g. http://www.joanneum.at/uploads/tx_publicationlibrary/ines_paper_fassold_final.pdf

@txbob as you mentioned,“as a proposed “theoretical value” or predictor of performance, if you know that your code is mostly memory bound, or mostly compute bound, then you can compare the peak numbers for these between the two specific architectures, and use that as a “theoretical best case” or predictor of optimal performance”, how do we calculate peak numbers for two architectures(my code is mostly compute bound)? Is there any law like Amdhal’s law or Gustafson’s law to calculate speed-up for GPU? From whatever I have read, these laws are mostly used for multi-core CPU. Is there any similar way to calculate ideal performance for GPU?

The peak throughput of single precision floating point multiply-add (or double precision floating point multiply-add) can be computed in a fairly simple fashion for both modern GPUs and modern CPUs.

In general, it requires some knowledge of the architecture in each case. The calculations are something like this:

CPU:

CPU cores * vector width * 2 * clock rate

GPU:

GPU cores * 2 * clock rate

These concepts are discussed extensively elsewhere if you care to look. Once you have these two numbers, you can use it as a first-order estimate of the ratio of performance to expect on a compute-bound code, if it is efficiently and effectively implemented for both architectures.

NVIDIA tends to be very conservative when they compare GPUs to CPUs. They will usually compare to an optimized multi-core CPU implementation, and usually these comparisons are for algorithms which often are not ‘ideal’ candidates for GPUs. A good example of this would be sparse linear algebra sub-routines or FFTs. In those cases you get about a 4-10 performance difference when compared to a good CPU multi-core implementation.

On the other hand image processing algorithms and brute force algorithms are much faster on GPUs than multi-core CPU implementations.

I have a few brute force algorithm implementations which I compare a single (not-overclocked) consumer GPU to a serial single-thread CPU implementation (overclocked 4.5 GHz);

[url]https://github.com/OlegKonings/GPU_4Polygon[/url]

[url]https://github.com/OlegKonings/CUDA_brute_triangle[/url]

And all those example brute-force examples use a great number of 64-bit integer operations which is much slower on consumer GPUs when compared to CPUs, yet still make up for it by a large margin.
You can run that code and see for yourself…

For all of these you can make the generous assumption that if the implementation was running concurrently on all of the available cores (usually 4-6 for i7 and as many as 16 for the Xenon) at the full clock speed the CPU time can be divided by the number of participating cores.

But even in that case some of those brute force algorithms outperform the single-threaded CPU implementation by as much as a factor of 500, and if you are able to get 4 CPU cores going concurrently at 4.5 GHz then the difference is only a factor of ~125.

For image processing I can perform a 3D separable Gaussian convolution on an image set of size (768x768x32, 15 element window) in 3-4 ms on a single GPU, while the same operation using 3 MATLAB convn() library calls takes over a second.

The real challenge with optimizing algorithms is to determine which portions can be performed on the GPU, and which portions should be performed on the CPU.