Comparing CPU and GPU Theoretical GFLOPS
Does anyone know how NVIDIA computes the CPU GFLOPS for the GFLOPS graph at the start of Section 1 in the CUDA Programming Guide? Ivy Bridge perf is shown to be around 650 GFLOPS. However, looking at Intel documentation for E5-2600 series CPUs, GFLOPs is listed < 200 (http://download.intel.com/support/processors/xeon/sb/xeon_E5-2600.pdf).
Does anyone know how NVIDIA computes the CPU GFLOPS for the GFLOPS graph at the start of Section 1 in the CUDA Programming Guide? Ivy Bridge perf is shown to be around 650 GFLOPS. However, looking at Intel documentation for E5-2600 series CPUs, GFLOPs is listed < 200 (http://download.intel.com/support/processors/xeon/sb/xeon_E5-2600.pdf).

#1
Posted 05/21/2014 04:35 PM   
From what I can tell, the most powerful IvyBridge processors have 15 cores running at 2.8 GHz (Xeon E7-2890 v2) where each core can deliver 16 SP or 8 DP results per cycle. If you multiply out those numbers, you get a theoretical throughput of 672 GFLOPS SP, or 336 GFLOPS DP, which seems entirely consistent with the graphs in the Programming Guide.
From what I can tell, the most powerful IvyBridge processors have 15 cores running at 2.8 GHz (Xeon E7-2890 v2) where each core can deliver 16 SP or 8 DP results per cycle. If you multiply out those numbers, you get a theoretical throughput of 672 GFLOPS SP, or 336 GFLOPS DP, which seems entirely consistent with the graphs in the Programming Guide.

#2
Posted 05/21/2014 05:37 PM   
The 650 number in that graph is a single-precision number. (The corresponding DP number is half as big.) Ivy Bridge (and Sandy Bridge) support 256-bit AVX and can sustain 16 SP or 8 DP ops per clock per core: [url]http://www.nas.nasa.gov/hecc/support/kb/Ivy-Bridge-Processors_445.html[/url] A high-end Ivy Bridge can have up to 12 cores (EP) or even 15 cores (EX), at clock frequencies up to 2.8GHz (EX) or 3.0GHz (EP): [url]http://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)[/url] Therefore an Ivy Bridge EX processor with 15 cores, 16 SP flops/cycle, and 2.8GHz would hit a theoretical 672 GFlops/sec (SP)
The 650 number in that graph is a single-precision number. (The corresponding DP number is half as big.)

Ivy Bridge (and Sandy Bridge) support 256-bit AVX and can sustain 16 SP or 8 DP ops per clock per core:

http://www.nas.nasa.gov/hecc/support/kb/Ivy-Bridge-Processors_445.html

A high-end Ivy Bridge can have up to 12 cores (EP) or even 15 cores (EX), at clock frequencies up to 2.8GHz (EX) or 3.0GHz (EP):

http://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)

Therefore an Ivy Bridge EX processor with 15 cores, 16 SP flops/cycle, and 2.8GHz would hit a theoretical 672 GFlops/sec (SP)

#3
Posted 05/21/2014 05:39 PM   
And how is this measured on the gpu side? Surely the serial-implementation versus parallel-implementation paradigm would bias the results in favour of the cpu? To accurately test, you must let the cpu run serial, and the gpu parallel, not so?
And how is this measured on the gpu side?

Surely the serial-implementation versus parallel-implementation paradigm would bias the results in favour of the cpu?
To accurately test, you must let the cpu run serial, and the gpu parallel, not so?

#4
Posted 05/21/2014 06:09 PM   
Thanks. I wonder why Intel reports significantly lower numbers in their export compliance documentation.
Thanks. I wonder why Intel reports significantly lower numbers in their export compliance documentation.

#5
Posted 05/21/2014 06:09 PM   
Perhaps pay attention to precision (single versus double), instruction (add, multiply or multiply and add) and SIMD (used, not used) stated in the different measures
Perhaps pay attention to precision (single versus double), instruction (add, multiply or multiply and add) and SIMD (used, not used) stated in the different measures

#6
Posted 05/21/2014 06:30 PM   
GPU theoretical flops calculation is similar conceptually. It will vary by GPU just as the CPU calculation varies by CPU architecture and model. To use K40m as an example: [url]http://www.nvidia.com/content/PDF/kepler/Tesla-K40-PCIe-Passive-Board-Spec-BD-06902-001_v05.pdf[/url] there are 15 SMs (2880/192), each with 64 DP ALUs that are capable of retiring one DP FMA instruction per cycle (== 2 DP Flops per cycle). 15 x 64 x 2 * 745MHz = 1.43 TFlops/sec which is the stated perf: [url]http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf[/url] The change for SP is that there are 192 SP ALUs per SM instead of 64 in the DP case, which yields exactly a tripling (of the DP perf) for SP perf: 3*1.43 = 4.29 TF SP
GPU theoretical flops calculation is similar conceptually. It will vary by GPU just as the CPU calculation varies by CPU architecture and model.

To use K40m as an example:

http://www.nvidia.com/content/PDF/kepler/Tesla-K40-PCIe-Passive-Board-Spec-BD-06902-001_v05.pdf

there are 15 SMs (2880/192), each with 64 DP ALUs that are capable of retiring one DP FMA instruction per cycle (== 2 DP Flops per cycle).

15 x 64 x 2 * 745MHz = 1.43 TFlops/sec

which is the stated perf:

http://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kepler-Family-Datasheet.pdf

The change for SP is that there are 192 SP ALUs per SM instead of 64 in the DP case, which yields exactly a tripling (of the DP perf) for SP perf: 3*1.43 = 4.29 TF SP

#7
Posted 05/21/2014 07:02 PM   
txbob: gpu architecture diagrams hardly show ALUs, just cores, with FPUs and I think "int units"; so when you refer to ALUs, I should think of cores, right? Also, are these units generally SIMT, or plain single instruction, do you know?
txbob:

gpu architecture diagrams hardly show ALUs, just cores, with FPUs and I think "int units"; so when you refer to ALUs, I should think of cores, right?

Also, are these units generally SIMT, or plain single instruction, do you know?

#8
Posted 05/21/2014 07:13 PM   
A GPU SM is a collection of various kinds functional units managed by scheduler(s), roughly speaking. My sloppy terminology was not intended to be perfect from a semantic point of view, but merely to communicate a concept in the calculation of peak theoretical performance. The term "core" in my opinion most closely translates to what I called a "SP ALU". I base this on page 8 of the GK110 white paper, which gives a reasonably good picture of the major functional units and organization, as well as (at least one definition of) NVIDIA terminology used to reference them: [url]http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf[/url] txbob: GK110 whitepaper p8: SP ALU -> Core DP ALU -> DP Unit In the absence of a grand poobah who declares truth, I think it's reasonable to use "Core" and "DP Unit", not my sloppy usage of the term "ALU". If you have other terms you'd like to use, such as FPUs, have at it.
A GPU SM is a collection of various kinds functional units managed by scheduler(s), roughly speaking. My sloppy terminology was not intended to be perfect from a semantic point of view, but merely to communicate a concept in the calculation of peak theoretical performance. The term "core" in my opinion most closely translates to what I called a "SP ALU". I base this on page 8 of the GK110 white paper, which gives a reasonably good picture of the major functional units and organization, as well as (at least one definition of) NVIDIA terminology used to reference them:

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

txbob: GK110 whitepaper p8:
SP ALU -> Core
DP ALU -> DP Unit

In the absence of a grand poobah who declares truth, I think it's reasonable to use "Core" and "DP Unit", not my sloppy usage of the term "ALU". If you have other terms you'd like to use, such as FPUs, have at it.

#9
Posted 05/21/2014 07:34 PM   
SIMT (single instruction multiple thread) refers to the idea that the scheduler, while processing what is effectively a single instruction stream, will marshall multiple functional units (in some fashion) to perform the activities of "multiple threads" "simultaneously". (to use a specific example:) 32 "Cores" (i.e. SP ALUs in txbob-speak) might be scheduled together to process a warp's activity associated with an SP FMA instruction coming from the (single) instruction stream. For a better description, refer to the programming guide: [url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture[/url]
SIMT (single instruction multiple thread) refers to the idea that the scheduler, while processing what is effectively a single instruction stream, will marshall multiple functional units (in some fashion) to perform the activities of "multiple threads" "simultaneously". (to use a specific example:) 32 "Cores" (i.e. SP ALUs in txbob-speak) might be scheduled together to process a warp's activity associated with an SP FMA instruction coming from the (single) instruction stream.

For a better description, refer to the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

#10
Posted 05/21/2014 07:46 PM   
I take no issue with your terminology And you have provided concrete references, thanks SIMT is closely associated with/ related to SIMD So, I take it that the SM ALUs or FPUs are generally SIMT/ SIMD, or not? Data sheets would normally stipulate x number of cuda cores per SM, for arithmetic instructions So, can one view the execution of arithmetic instructions by such cores from a SIMD viewpoint, or not (as one can view the execution of arithmetic by the cpu FPU as SIMD, in certain cases(when used))? In the above, perhaps SIMT/ SIMD should merely be taken to mean single instruction performed on multiple input/ data/ register sets
I take no issue with your terminology
And you have provided concrete references, thanks

SIMT is closely associated with/ related to SIMD
So, I take it that the SM ALUs or FPUs are generally SIMT/ SIMD, or not?
Data sheets would normally stipulate x number of cuda cores per SM, for arithmetic instructions
So, can one view the execution of arithmetic instructions by such cores from a SIMD viewpoint, or not (as one can view the execution of arithmetic by the cpu FPU as SIMD, in certain cases(when used))?
In the above, perhaps SIMT/ SIMD should merely be taken to mean single instruction performed on multiple input/ data/ register sets

#11
Posted 05/22/2014 04:52 AM   
Each single "CUDA core" (or FPU in non-Nvidia terminology) by itself is not SIMT or SIMD. SIMT is the fact that a single instruction commands 32 (i.e. the warpsize) "CUDA cores" to perform the same operation.
Each single "CUDA core" (or FPU in non-Nvidia terminology) by itself is not SIMT or SIMD.
SIMT is the fact that a single instruction commands 32 (i.e. the warpsize) "CUDA cores" to perform the same operation.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#12
Posted 05/22/2014 08:38 AM   
Intel responded to my request for more information on their export compliance documentation: "GFLOPs is chip-wide/combined performance... All GFLOPS, CTP and APP calculations contained herein were based on specifications taken from Intel datasheets..." Intel appears to have a more conservative method for determining theoretical peak GFLOPS. For example, they report that their E5-2690 has a peak performance of 185.6 GFLOPS. However, in order to make fair apples-to-apples comparison among Intel CPUs and NVIDIA GPUs, the more optimistic method described by njuffa and txbob seems necessary (672 GFLOPS for the E5-2690).
Intel responded to my request for more information on their export compliance documentation:
"GFLOPs is chip-wide/combined performance... All GFLOPS, CTP and APP calculations contained herein were based on specifications taken from Intel datasheets..."

Intel appears to have a more conservative method for determining theoretical peak GFLOPS. For example, they report that their E5-2690 has a peak performance of 185.6 GFLOPS. However, in order to make fair apples-to-apples comparison among Intel CPUs and NVIDIA GPUs, the more optimistic method described by njuffa and txbob seems necessary (672 GFLOPS for the E5-2690).

#13
Posted 05/23/2014 11:47 PM   
My computation of 336/672 GFLOPS in #2 above was for the Xeon E7-2890 v2, which is an IvyBridge EX-class server CPU. The E5-2690 is a SandyBridge EP-class server CPU with 8 cores running at 2.9 GHz, where each core produces 8/16 results per cycle. Thus theoretical GFLOPS for this CPU are 186/371, and this would appear to match the number you quote from Intel documentation. When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes.
My computation of 336/672 GFLOPS in #2 above was for the Xeon E7-2890 v2, which is an IvyBridge EX-class server CPU. The E5-2690 is a SandyBridge EP-class server CPU with 8 cores running at 2.9 GHz, where each core produces 8/16 results per cycle. Thus theoretical GFLOPS for this CPU are 186/371, and this would appear to match the number you quote from Intel documentation.

When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes.

#14
Posted 05/24/2014 12:51 AM   
[quote="njuffa"] When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes. [/quote] Cripes! As if Intel didn't make their naming scheme difficult enough! Thanks for pointing this out.
njuffa said:
When making these comparisons, please look closely at exact processor designations including possible v2 and v3 suffixes.

Cripes! As if Intel didn't make their naming scheme difficult enough! Thanks for pointing this out.

#15
Posted 05/24/2014 03:34 AM   
Scroll To Top

Add Reply