Intel paper: Debunking the 100X GPU vs. CPU myth
  1 / 3    
A somewhat controversial paper was presented at the ISCA conference this week:
[url="http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=94608761&CFTOKEN=50783980&ret=1#Fulltext"]Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU[/url], by Victor Lee et al. from Intel.

I think it may be an interesting read for the CUDA developer community... (and it has been long since we last had a speedup measurement methodology debate :) )

The authors compare the performance of several parallel kernels on a Core i7 960 against a GTX 280. Kernels are highly-tuned on both sides.
They measure very reasonable speed-ups, from 0.5x to 14x, and 2.5x on average.
The papers follows on by analyzing the causes of suboptimal performance on both sides, and the implications on architecture design.

So here is the official PR answer from NV:
[url="http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html"]http://blogs.nvidia.com/ntersect/2010/06/g...says-intel.html[/url]

Unlike the blog poster, I would not question the fairness of Intel's analysis. But he does have a point in claiming that the myth is that modern CPUs are easier to program than GPUs.
In this regard, it is interesting to note that Fermi's improvements are mostly on the programmability side, and not that much on raw performance...

Any thoughts about this?
A somewhat controversial paper was presented at the ISCA conference this week:

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, by Victor Lee et al. from Intel.



I think it may be an interesting read for the CUDA developer community... (and it has been long since we last had a speedup measurement methodology debate :) )



The authors compare the performance of several parallel kernels on a Core i7 960 against a GTX 280. Kernels are highly-tuned on both sides.

They measure very reasonable speed-ups, from 0.5x to 14x, and 2.5x on average.

The papers follows on by analyzing the causes of suboptimal performance on both sides, and the implications on architecture design.



So here is the official PR answer from NV:

http://blogs.nvidia.com/ntersect/2010/06/g...says-intel.html



Unlike the blog poster, I would not question the fairness of Intel's analysis. But he does have a point in claiming that the myth is that modern CPUs are easier to program than GPUs.

In this regard, it is interesting to note that Fermi's improvements are mostly on the programmability side, and not that much on raw performance...



Any thoughts about this?
#1
Posted 06/25/2010 12:20 PM   
From a high performance perspective I agree with this paper. My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp active, both using multiple cores.

If you are willing to multi-thread your application, map it onto SIMD units, and carefully orchestrate memory traffic such that you hit the theoretical peak OP/s, then the speedup of your application will be commensurate with the difference in peak performance between two different architectures. The ratio of peak instructions/s (~600 or ~900)/(~50 or ~100) = ~9-12 or peak bandwidth GB/s (~140/~25) = ~5-6 should be the speedup of you application if you spend an infinite amount of time optimizing each implementation. It should be ~5-12 on that GPU compared to that CPU, not 100x.

A more important take away from this paper was that getting close to the peak performance on both the CPU and the GPU required:

1) Multi-threading to saturate all cores.
2) SIMD to exploit multiple FUs per core.
3) Tuning memory traffic to exploit the full bandwidth from all memory controllers.

For people on this forum, I would think that this paper makes CUDA more relevant as it addresses all of these issues.

EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.
From a high performance perspective I agree with this paper. My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp active, both using multiple cores.



If you are willing to multi-thread your application, map it onto SIMD units, and carefully orchestrate memory traffic such that you hit the theoretical peak OP/s, then the speedup of your application will be commensurate with the difference in peak performance between two different architectures. The ratio of peak instructions/s (~600 or ~900)/(~50 or ~100) = ~9-12 or peak bandwidth GB/s (~140/~25) = ~5-6 should be the speedup of you application if you spend an infinite amount of time optimizing each implementation. It should be ~5-12 on that GPU compared to that CPU, not 100x.



A more important take away from this paper was that getting close to the peak performance on both the CPU and the GPU required:



1) Multi-threading to saturate all cores.

2) SIMD to exploit multiple FUs per core.

3) Tuning memory traffic to exploit the full bandwidth from all memory controllers.



For people on this forum, I would think that this paper makes CUDA more relevant as it addresses all of these issues.



EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.

#2
Posted 06/25/2010 01:07 PM   
Section 5 of this paper (5.1: Platform Optimization Guide and 5.2 Hardware Recommendations) is a contribution to the discussion and I believe is something close to required reading for anyone who has the ambition to compare the performance of CPUs and GPUs:

[quote][W]e study how architectural features such as core complexity, cache/buffer design, and fixed function units impact [i]throughput[/i] computing workloads.[/quote]

However the rest of the paper is mostly Intel marketing:
[quote]In addition, this paper also presents a fair comparison between performance on CPUs and GPUs and dispels the myth that GPUs are 100x-1000x faster than CPUs for [i]throughput [/i]computing kernels.[/quote]
Handpicking benchmarks neither proves nor debunks anything and is no more objective than the Nvidia marketing response.
Section 5 of this paper (5.1: Platform Optimization Guide and 5.2 Hardware Recommendations) is a contribution to the discussion and I believe is something close to required reading for anyone who has the ambition to compare the performance of CPUs and GPUs:



[W]e study how architectural features such as core complexity, cache/buffer design, and fixed function units impact throughput computing workloads.




However the rest of the paper is mostly Intel marketing:

In addition, this paper also presents a fair comparison between performance on CPUs and GPUs and dispels the myth that GPUs are 100x-1000x faster than CPUs for throughput computing kernels.


Handpicking benchmarks neither proves nor debunks anything and is no more objective than the Nvidia marketing response.

#3
Posted 06/25/2010 03:04 PM   
Wake me up again when there is an efficient CUDA compiler running on its own target architecture.
Wake me up again when there is an efficient CUDA compiler running on its own target architecture.

#4
Posted 06/25/2010 03:42 PM   
" My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp ctive, both using multiple cores.
"

Is he crasy?
" My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp ctive, both using multiple cores.

"



Is he crasy?

#5
Posted 06/25/2010 03:51 PM   
[quote name='Gregory Diamos' post='1077875' date='Jun 25 2010, 06:07 AM']From a high performance perspective I agree with this paper. My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp active, both using multiple cores.[/quote]

Actually they seem to neglect the distinction between warp size (32) and "microarchitectural" SIMD width (8). So the quoted scalar throughput should be 4 times lower : there is just no way to issue more than 1 instruction per SM clock on GT200.
I believe scalar throughput is a meaningful metric for measuring how fast you can process scalar work such as address calculations and loop control.
Even in CUDA applications, I measured that around 30% of all instructions actually do scalar work and could be replaced by scalar instructions (although that ratio shrinks a bit in highly-tuned apps). The GPU suffers an overhead by performing address calculations and control inside SIMD units, and that should be taken into the balance for the comparison.

[quote]EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.[/quote]

The 100x speedup might be relevant for a customer has some legacy code they want to accelerate and choose between:
- spending x$ in faster CPUs, throwing more hardware at the problem (may not even work if the implementation does not scale),
- spending y$ in development to get a 20x speedup on their current CPU hardware,
- spending z1$ in development and z2$ in GPUs to get a 100x speedup.

At the end of the day, what matters to them is the development effort required to get a decent acceleration, not really the peak performance... Unless they can use some readily-available, highly-tuned libraries.

[quote name='Tom Milledge' post='1077913' date='Jun 25 2010, 08:04 AM']Handpicking benchmarks neither proves nor debunks anything and is no more objective than the Nvidia marketing response.[/quote]

Well, at least they wrote 3 pages explaining why they believe their benchmarks are meaningful and representative and what their workload characteristics are, instead of just dismissing the other position by handwaving... ;)
Which application(s) do you think they should have included, which have characteristics not covered by their benchmarks?
[quote name='Gregory Diamos' post='1077875' date='Jun 25 2010, 06:07 AM']From a high performance perspective I agree with this paper. My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp active, both using multiple cores.



Actually they seem to neglect the distinction between warp size (32) and "microarchitectural" SIMD width (8). So the quoted scalar throughput should be 4 times lower : there is just no way to issue more than 1 instruction per SM clock on GT200.

I believe scalar throughput is a meaningful metric for measuring how fast you can process scalar work such as address calculations and loop control.

Even in CUDA applications, I measured that around 30% of all instructions actually do scalar work and could be replaced by scalar instructions (although that ratio shrinks a bit in highly-tuned apps). The GPU suffers an overhead by performing address calculations and control inside SIMD units, and that should be taken into the balance for the comparison.



EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.




The 100x speedup might be relevant for a customer has some legacy code they want to accelerate and choose between:

- spending x$ in faster CPUs, throwing more hardware at the problem (may not even work if the implementation does not scale),

- spending y$ in development to get a 20x speedup on their current CPU hardware,

- spending z1$ in development and z2$ in GPUs to get a 100x speedup.



At the end of the day, what matters to them is the development effort required to get a decent acceleration, not really the peak performance... Unless they can use some readily-available, highly-tuned libraries.



[quote name='Tom Milledge' post='1077913' date='Jun 25 2010, 08:04 AM']Handpicking benchmarks neither proves nor debunks anything and is no more objective than the Nvidia marketing response.



Well, at least they wrote 3 pages explaining why they believe their benchmarks are meaningful and representative and what their workload characteristics are, instead of just dismissing the other position by handwaving... ;)

Which application(s) do you think they should have included, which have characteristics not covered by their benchmarks?
#6
Posted 06/25/2010 04:15 PM   
[quote name='Sylvain Collange' post='1077941' date='Jun 25 2010, 04:15 PM']I believe scalar throughput is a meaningful metric for measuring how fast you can process scalar work such as address calculations and loop control.
Even in CUDA applications, I measured that around 30% of all instructions actually do scalar work and could be replaced by scalar instructions (although that ratio shrinks a bit in highly-tuned apps). The GPU suffers an overhead by performing address calculations and control inside SIMD units, and that should be taken into the balance for the comparison.[/quote]

It is wrong approach. Anyway we need some warps active, about 16, and we can add 16*32 threads for free. Cause we NEED minimum 384 active threads to load GPU, with 16 active threads it will work like it has 384 anyway. That comparision is total biased.
[quote name='Sylvain Collange' post='1077941' date='Jun 25 2010, 04:15 PM']I believe scalar throughput is a meaningful metric for measuring how fast you can process scalar work such as address calculations and loop control.

Even in CUDA applications, I measured that around 30% of all instructions actually do scalar work and could be replaced by scalar instructions (although that ratio shrinks a bit in highly-tuned apps). The GPU suffers an overhead by performing address calculations and control inside SIMD units, and that should be taken into the balance for the comparison.



It is wrong approach. Anyway we need some warps active, about 16, and we can add 16*32 threads for free. Cause we NEED minimum 384 active threads to load GPU, with 16 active threads it will work like it has 384 anyway. That comparision is total biased.

#7
Posted 06/25/2010 04:26 PM   
[quote name='Lev' post='1077929' date='Jun 25 2010, 04:51 PM']" My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp ctive...

Is he crasy[/quote]
reading that back I can see how what I wrote could be misinterpreted. What I meant was that in one table in the paper, they compared theoretical performance using one thread per core and no sse on the cpu against one active simd unit on the gpu and called it scalar throughput, even though threads in cuda are implicitly mapped onto simd units.

To clarify, it was only one entry in one table, not the main .5x to 14x result.
[quote name='Lev' post='1077929' date='Jun 25 2010, 04:51 PM']" My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp ctive...



Is he crasy

reading that back I can see how what I wrote could be misinterpreted. What I meant was that in one table in the paper, they compared theoretical performance using one thread per core and no sse on the cpu against one active simd unit on the gpu and called it scalar throughput, even though threads in cuda are implicitly mapped onto simd units.



To clarify, it was only one entry in one table, not the main .5x to 14x result.

#8
Posted 06/25/2010 04:30 PM   
[quote name='Sylvain Collange' post='1077941' date='Jun 25 2010, 12:15 PM']Which application(s) do you think they should have included?[/quote]
The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise.
[quote name='Sylvain Collange' post='1077941' date='Jun 25 2010, 12:15 PM']Which application(s) do you think they should have included?

The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise.

#9
Posted 06/25/2010 04:33 PM   
"The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise. "

This will be good. But we have not enough Intel enginers to optimize every application. Some time it is harder to optimize for CPU, some times for GPU.
"The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise. "



This will be good. But we have not enough Intel enginers to optimize every application. Some time it is harder to optimize for CPU, some times for GPU.

#10
Posted 06/25/2010 04:38 PM   
[quote name='Lev' post='1077954' date='Jun 25 2010, 12:38 PM']"The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise. "

This will be good. But we have not enough Intel enginers to optimize every application. Some time it is harder to optimize for CPU, some times for GPU.[/quote]
Agreed. My point (and this was with the marketing aspect of the Intel paper) was that they were not so much "debunking" 100x-plus speed-up claims, as ignoring them. Too bad the TV show "MythBusters" doesn't cover HPC topics. :)
[quote name='Lev' post='1077954' date='Jun 25 2010, 12:38 PM']"The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting "a fair comparison between performance on CPUs and GPUs", they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise. "



This will be good. But we have not enough Intel enginers to optimize every application. Some time it is harder to optimize for CPU, some times for GPU.

Agreed. My point (and this was with the marketing aspect of the Intel paper) was that they were not so much "debunking" 100x-plus speed-up claims, as ignoring them. Too bad the TV show "MythBusters" doesn't cover HPC topics. :)

#11
Posted 06/25/2010 04:52 PM   
[quote]...versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.[/quote]In my experience, that would be a positively [i]glowing[/i] endorsement of a lot of scientific codes.
...versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.
In my experience, that would be a positively glowing endorsement of a lot of scientific codes.

#12
Posted 06/25/2010 04:52 PM   
[quote name='Tom Milledge' post='1077959' date='Jun 25 2010, 09:52 AM']Too bad the TV show "MythBusters" doesn't cover HPC topics. :)[/quote]

Well, they did cover GPU computing: :)
[url="http://www.nvidia.com/object/nvision08_gpu_v_cpu.html"]http://www.nvidia.com/object/nvision08_gpu_v_cpu.html[/url]
[quote name='Tom Milledge' post='1077959' date='Jun 25 2010, 09:52 AM']Too bad the TV show "MythBusters" doesn't cover HPC topics. :)



Well, they did cover GPU computing: :)

http://www.nvidia.com/object/nvision08_gpu_v_cpu.html
#13
Posted 06/25/2010 07:57 PM   
Great opening line on the NVIDIA response:

"It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs"


I'd say it's probably worth pointing out that the i7-960 they were using costs what, double what the gtx280 costs now? I do definitely agree though (and it's been discussed here before) that a lot of the speed-up numbers quotes are pretty bogus. One that claimed "100x speedup" by using 4 teslas with a combined cost far in excess of the cpu they were comparing to stands out in my mind. It's pretty impossible to give a real apples to apples speed comparison because of disparities in the effort needed to program, hardware cost, lack of portability, etc, so all we're left with is real world results of people getting their work done faster.

I personally prefer to say that for very little cost, my runtimes went from 8 hours to 20 minutes, instead of quoting "25x" or something of that manner.
Great opening line on the NVIDIA response:



"It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs"





I'd say it's probably worth pointing out that the i7-960 they were using costs what, double what the gtx280 costs now? I do definitely agree though (and it's been discussed here before) that a lot of the speed-up numbers quotes are pretty bogus. One that claimed "100x speedup" by using 4 teslas with a combined cost far in excess of the cpu they were comparing to stands out in my mind. It's pretty impossible to give a real apples to apples speed comparison because of disparities in the effort needed to program, hardware cost, lack of portability, etc, so all we're left with is real world results of people getting their work done faster.



I personally prefer to say that for very little cost, my runtimes went from 8 hours to 20 minutes, instead of quoting "25x" or something of that manner.

#14
Posted 06/25/2010 10:31 PM   
Or for me, quoting that I have upgraded a $300 PC, that is 3 years old, for les sthan $100 and have an application that is faster than on a $1000 PC?
(and actually, as the app run on GPU, I still have the PC toally reactive when the app gobble the CUDA GPU resource)
Or for me, quoting that I have upgraded a $300 PC, that is 3 years old, for les sthan $100 and have an application that is faster than on a $1000 PC?

(and actually, as the app run on GPU, I still have the PC toally reactive when the app gobble the CUDA GPU resource)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#15
Posted 06/28/2010 02:48 PM   
  1 / 3    
Scroll To Top