Cuda 7.0 Jetson TX1 performance and benchmarks

yahoo2016 · December 4, 2015, 12:02pm

Has anyone run benchmarks on TX1? I got glmark2 score 818 on my Shield TV.

simpleMulticopy produced poorer performance than TK1:

[simpleMultiCopy] - Starting…

Using CUDA device [0]: GM20B
[GM20B] has 2 MP(s) x 128 (Cores/MP) = 256 (Cores)
Device name: GM20B
CUDA Capability 5.3 hardware with 2 multi-processors
scale_factor = 1.00
array_size = 4194304

Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 25.526670 ms
Compute can overlap with one transfer: 19.573042 ms
Compute can overlap with both data transfers: 15.620518 ms

Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 9.440632 ms
Avg. time when overlapped using 4 streams : 5.101471 ms
Avg. speedup gained (serialized - overlapped) : 4.339161 ms

Measured throughput:
Fully serialized execution : 3.554257 GB/s
Overlapped using 4 streams : 6.577403 GB/s

yahoo2016 · December 4, 2015, 12:08pm

The following results were from Tegra K1 (Chromebook CB5):

[simpleMultiCopy] - Starting…
modprobe: FATAL: Module nvidia not found.

Using CUDA device [0]: GK20A
[GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
Device name: GK20A
CUDA Capability 3.2 hardware with 1 multi-processors
scale_factor = 1.00
array_size = 4194304

Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
( ) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
Memcpy host to device : 1.233408 ms (13.602325 GB/s)
Memcpy device to host : 1.231520 ms (13.623177 GB/s)
Kernel : 2.142368 ms (78.311548 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 4.607296 ms
Compute can overlap with one transfer: 2.464928 ms
Compute can overlap with both data transfers: 2.142368 ms

Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 5.033206 ms
Avg. time when overlapped using 4 streams : 4.325859 ms
Avg. speedup gained (serialized - overlapped) : 0.707348 ms

Measured throughput:
Fully serialized execution : 6.666611 GB/s
Overlapped using 4 streams : 7.756709 GB/s

dusty_nv · December 4, 2015, 1:54pm

Can you try running the ‘max perf script’ listed below on TX1 before benchmarking?

#!/bin/sh

# turn on fan for safety
echo "Enabling fan for safety..."
if [ ! -w /sys/kernel/debug/tegra_fan/target_pwm ] ; then
	echo "Cannot set fan -- exiting..."
fi
echo 255 > /sys/kernel/debug/tegra_fan/target_pwm

echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo 1 > /sys/kernel/cluster/immediate
echo 1 > /sys/kernel/cluster/force
echo G > /sys/kernel/cluster/active
echo "Cluster: `cat /sys/kernel/cluster/active`"

# online all CPUs - ignore errors for already-online units
echo "onlining CPUs: ignore errors..."
for i in 0 1 2 3 ; do
	echo 1 > /sys/devices/system/cpu/cpu${i}/online
done
echo "Online CPUs: `cat /sys/devices/system/cpu/online`"

# set CPUs to max freq (perf governor not enabled on L4T yet)
echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cpumax=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies | awk '{print $NF}'`
echo "${cpumax}" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
for i in 0 1 2 3 ; do
	echo "CPU${i}: `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq`"
done

# max GPU clock (should read from debugfs)
cat /sys/kernel/debug/clock/gbus/max > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
echo "GPU: `cat /sys/kernel/debug/clock/gbus/rate`"

# max EMC clock (should read from debugfs)
cat /sys/kernel/debug/clock/emc/max > /sys/kernel/debug/clock/override.emc/rate
echo 1 > /sys/kernel/debug/clock/override.emc/state
echo "EMC: `cat /sys/kernel/debug/clock/emc/rate`"

Also please see these posts related to the script:

https://devtalk.nvidia.com/default/topic/894945/jetson-embedded-systems/jetson-tx1/post/4740508/#4740508
https://devtalk.nvidia.com/default/topic/894945/jetson-embedded-systems/jetson-tx1/post/4737266/#4737266

yahoo2016 · December 4, 2015, 6:38pm

Thanks!

The fan did turn on and I got significant performance improvement:

Measured timings (throughput):
Memcpy host to device : 1.670526 ms (10.043074 GB/s)
Memcpy device to host : 1.709841 ms (9.812150 GB/s)
Kernel : 2.699841 ms (62.141496 GB/s)

versus before running the script:

Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)

but I got some errors when running the script;

ubuntu@tegra-ubuntu:~/x1$ sudo ./maxPerf.sh
Enabling fan for safety…
./maxPerf.sh: 11: ./maxPerf.sh: cannot create /sys/kernel/cluster/immediate: Directory nonexistent
./maxPerf.sh: 12: ./maxPerf.sh: cannot create /sys/kernel/cluster/force: Directory nonexistent
./maxPerf.sh: 13: ./maxPerf.sh: cannot create /sys/kernel/cluster/active: Directory nonexistent
cat: /sys/kernel/cluster/active: No such file or directory
Cluster:
onlining CPUs: ignore errors…
./maxPerf.sh: 19: ./maxPerf.sh: cannot create /sys/devices/system/cpu/cpu0/online: Directory nonexistent
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
Online CPUs: 0-3
CPU0: 2014500
CPU1: 2014500
CPU2: 2014500
CPU3: 2014500
GPU: 998400000
EMC: 1600000000

dusty_nv · December 4, 2015, 7:12pm

Those errors from the script are erroneous (all the CPU cores should be online, ect.), but you can keep an eye on tegrastats to check it:

ubuntu@tegra-ubuntu:$ ~/tegrastats
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [4%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912
RAM 129/3854MB (lfb 781x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,0%,0%,0%]@102 EMC 5%@40 AVP 3%@80 VDE 0 GR3D 0%@38 EDP limit 1912

Here are the acronyms of the different clusters that tegrastats reports:

EMC – memory controller
AVP – audio/video processor
VDE – video decoder engine
GR3D – GPU

yahoo2016 · December 4, 2015, 7:33pm

Where can I find “tegrastats”?

I did get 309 GFLOPS for sample “nbody” from tegra X1, i.e., twice of GFLOPs of tegra K1.

JensM · December 4, 2015, 10:54pm

Huh? I get 259 Gflops from the K1 and “nbody -benchmark” (also on a CB5)

yahoo2016 · December 5, 2015, 1:02am

Good catch!

I followed the link above from dusty_nv:

to this link:

http://www.slothparadise.com/how-to-install-cuda-on-nvidia-jetson-tx1/

Which refer to this link for TK1 157 GFLOPS:

I just tried and did get 259 Gflops from TK1 based Chromebook CB5.

It seems TX1 still needs optimizations.

yahoo2016 · December 5, 2015, 1:15am

I re-run “nbody -benchmark -numbodies=65536” for both Shield TV(X1) and CB5(K1) and got:

X1: 315 GFLOPS

K1: 311 GFLOPS

What is missing for X1?

Jimmy_Pettersson · December 5, 2015, 12:15pm

From what I’ve read the chromebook has different memories with much higher bandwidth than the regular tegra K1, this may well account for the performance difference as even the nbody problem could be memory bound.

Couldn’t find a source regarding the memory by quick googling.

Anyhow, ~157 GFLOP for nbody is pretty standard on Jetson Tk1.

Jimmy_Pettersson · December 5, 2015, 12:34pm

313 GFLOPS on my machine:

ubuntu@tegra-ubuntu:/usr/local/cuda/samples/5_Simulations/nbody$ ./nbody -benchmark -numbodies=65536
Run “nbody -benchmark [-numbodies=]” to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies= (number of bodies (>= 1) to run in simulation)
-device= (where d=0,1,2… for the CUDA device to use)
-numdevices= (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “GM20B” with compute capability 5.3

Compute 5.3 CUDA device: [GM20B]
number of bodies = 65536
65536 bodies, total time for 10 iterations: 2741.747 ms
= 15.665 billion interactions per second
= 313.301 single-precision GFLOP/s at 20 flops per interaction

Jimmy_Pettersson · December 5, 2015, 12:46pm

on my machine:

ubuntu@tegra-ubuntu:/usr/local/cuda/samples/5_Simulations/nbody$ ./nbody -benchmark -numbodies=65536
Run “nbody -benchmark [-numbodies=]” to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies= (number of bodies (>= 1) to run in simulation)
-device= (where d=0,1,2… for the CUDA device to use)
-numdevices= (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “GM20B” with compute capability 5.3

Compute 5.3 CUDA device: [GM20B]
number of bodies = 65536
65536 bodies, total time for 10 iterations: 2741.747 ms
= 15.665 billion interactions per second
= 313.301 single-precision GFLOP/s at 20 flops per interaction

yahoo2016 · December 5, 2015, 12:51pm

Is your “machine” Jetson TX1? I do not have TX1, I used Shield TV and was worried if SDRAM for Shield TV is too small (3GB) and/or too slow.

yahoo2016 · December 5, 2015, 12:58pm

For both TK1 and TX1, CPU/GPU clocks must be maximized to run the benchmark tests, as showed in links in multiple places.

It seems those scripts to maximize clock need to be run every time after power up.

Jimmy_Pettersson · December 5, 2015, 5:19pm

yes jetson tx1

yahoo2016 · December 5, 2015, 7:32pm

I’m glad $199 Shield TV with only 3GB ram gets same performance as TX1 for this test.

I did notice the “boxFilter” sample ran much faster on X1 than K1

JensM · December 5, 2015, 7:50pm

Oh well, nbody might not be the best generic benchmark then - let’s try something else:

~/6.5_Samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GK20A" with compute capability 3.2

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Computing result using CUBLAS...done.
Performance= 223.12 GFlop/s, Time= 0.587 msec, Size= 131072000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

(That’s still on a CB5, apparently with wider memorybus than the Jetson.)

yahoo2016 · December 5, 2015, 10:48pm

Oh well, nbody might not be the best generic benchmark then - let’s try something else:

~/6.5_Samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GK20A" with compute capability 3.2

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Computing result using CUBLAS...done.
Performance= 223.12 GFlop/s, Time= 0.587 msec, Size= 131072000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

(That’s still on a CB5, apparently with wider memorybus than the Jetson.)

For “./matrixMulCUBLAS”,

My Shield TV X1 showed only 153 Gflops.
My CB5 K1 showed 207 Gflops.

On the other hand for “./boxFilter - benchmark”,

My Shield TV X1 showed 410 M RGBA Pixels/s.
My CB5 K1 showed only 37 M RGBA Pixels/s.

MartijnBerger · December 7, 2015, 5:28pm

I compiled blender and tested the BMW scene in cycles.

It can do this in 9:48. ( BVH building alone on the CPU takes 55 seconds, post processing takes 20ish seconds ).

A high end desktop card can do this in under 30 seconds. there both building and post processing are 1-2 seconds.

http://www.pasteall.org/pic/show.php?id=96308

yahoo2016 · December 7, 2015, 10:55pm

I ran convolutionFFT2D, the results:

K1: 114 MPix/s

X1: 250 MPixel/s