FLOPs for Jetson host

highendcompute · February 22, 2015, 3:16pm

Hi - can’t seem to find the FLOPS/sec per core for the Cortex-A15 Jetson host processor… anybody care to send me a relevant URL? Ta, M

Nicholas_762 · February 22, 2015, 5:04pm

…

highendcompute · February 22, 2015, 8:58pm

Hi - I’m looking got max performance for the chip (rather than what a prog achieves) - in order to see how well my own implementation is doing. Many thanks, M

kulve · February 23, 2015, 10:19am

I guess FLOPS usually go with the frequency and should be quite easy to calculate?

And with Tegra K1, the GPU can do a lot more FLOPS than the CPU.

In any case, do remember to lock the CPU/GPU/EMC clocs when doing any performance measurements:
[url]http://elinux.org/Jetson/Performance[/url]

highendcompute · February 24, 2015, 9:36pm

I believe the Cortex A15 can do 3 instructions per cycle (see http://www.quora.com/What-is-so-great-about-ARMs-Cortex-A-15). Let’s pretend a floating-point operation is the same as a cycle, and that each core of the A15 runs at a frequency of 2.3 GHz (see TK1 entry of ARM Cortex-A15 - Wikipedia). This gives the per-core, max figure of 6.9 GFLOPS/sec

I hadn’t see that elinux Jetson Performance page before but it looks very useful so I’ll see if my practical applications achieve anything like 6.9 GFLOPS/sec per core

highendcompute · February 24, 2015, 10:18pm

tried practically, nbody for fp64, single core set to ‘performance’, and the nbody benchmark gave about 0.1 GFLOPS/sec

Nicholas_762 · February 25, 2015, 12:44am

…

highendcompute · March 14, 2015, 11:51am

Hi again, Yes Nicholas_762, totally agree but am looking for a figure for the theoretical peak.
For ref, use of cblas_dgemm from the standard BLAS/LAPACK from Ubuntu repositories seems to max out at 0.377 GFLOPS/sec so better than nbody (as expected) but a factor of 18 lower than my derived max of 6.9 GFLOPS

I’d expect BLAS3 to be highly optimised but will try ATLAS (auto-tuining) to see if more to eek out.

Yours, highendcompute

kulve · March 14, 2015, 2:22pm

Just to emphasize the difference between CPU and GPU:

http://www.pugetsystems.com/blog/2014/05/23/NVIDIA-Jetson-TK1-CUDA-performance-569/

System              GFLOPS (single precision)
Jetson TK           157.592
**Jetson ARM CPU    0.076

highendcompute · April 18, 2015, 8:03pm

Hi - the ‘puget’ timings are for the nbody but as a I noted above they won’t get near peak performance - see my 0.377 DP GLOPS/sec fpr a single core of the Cortex running dgemme versus the 0.076 SP GFLOPS/sec quoted fpr the nbody example.

It’s very interesting to then work out how to measure GLOPS/sec/Watt for a given component of the Tegra - I’ve a nascent blog on this… URL to follow when a few more timings are complete.

Yours,

ShervinE · April 23, 2015, 10:17pm

Hi Michael,

Each ARM Cortex-A15 CPU core such as the 4 of them in Tegra K1 (and Tegra 4 before it) can decode 3 assembly instructions per CPU clock, and dispatch to 7 possible execution pipelines, while using out-of-order multi-issue with speculation (think of it as a code optimizer but inside the CPU hardware!). Hence why you often see real applications run 1.5x to 2x as fast on Cortex-A15 than Cortex-A9 CPUs.

So theoretically, due to the 7 executers you could even say that 7 instructions can be executing in each of the 4 CPU cores at the same time, giving a potential for 28 instructions to execute for each CPU clock cycle. But it is EXTREMELY unlikely to ever obtain that in actual code (since you’d need to be doing the perfect combination of multiplies, adds, boolean logic, etc). So due to the 3 instruction decoders it would be more realistic to say the theoretical max is 3 instructions per cycle per core. But note that each ARM Cortex-A15 core has a NEON SIMD 128-bit vector unit, and so a single “instruction” can be something like “vadd q0, q1, q2” that does 16 additions of 8-bit values (eg: pixels) in one clock cycle! But then you need to consider that the CPU pipeline can’t handle as many NEON instructions as simpler instructions, so it doesn’t quite scale perfectly. And note that in Cortex-A15, floating point NEON instructions are almost as fast as 32-bit integer NEON instructions (but not quite as fast). Whereas 64-bit floats are much slower.

For actual measured results, I wrote the NEON perf benchmarks we use internally in NVIDIA and have seen very impressive actual results with quad-core NEON on Tegra 4 / Tegra K1. I’m probably not allowed to post the actual results publicly but let’s just say that it performs very well if you write NEON assembly code well and don’t use any memory accesses!