Has anyone run benchmarks on TX1? I got glmark2 score 818 on my Shield TV.
simpleMulticopy produced poorer performance than TK1:
[simpleMultiCopy] - Starting…
Using CUDA device [0]: GM20B
[GM20B] has 2 MP(s) x 128 (Cores/MP) = 256 (Cores)
Device name: GM20B
CUDA Capability 5.3 hardware with 2 multi-processors
scale_factor = 1.00
array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)
Measured timings (throughput):
Memcpy host to device : 15.620518 ms (1.074050 GB/s)
Memcpy device to host : 3.952524 ms (4.244684 GB/s)
Kernel : 5.953629 ms (28.179814 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 25.526670 ms
Compute can overlap with one transfer: 19.573042 ms
Compute can overlap with both data transfers: 15.620518 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 9.440632 ms
Avg. time when overlapped using 4 streams : 5.101471 ms
Avg. speedup gained (serialized - overlapped) : 4.339161 ms
The following results were from Tegra K1 (Chromebook CB5):
[simpleMultiCopy] - Starting…
modprobe: FATAL: Module nvidia not found.
Using CUDA device [0]: GK20A
[GK20A] has 1 MP(s) x 192 (Cores/MP) = 192 (Cores)
Device name: GK20A
CUDA Capability 3.2 hardware with 1 multi-processors
scale_factor = 1.00
array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property “deviceOverlap”)
( ) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)
Measured timings (throughput):
Memcpy host to device : 1.233408 ms (13.602325 GB/s)
Memcpy device to host : 1.231520 ms (13.623177 GB/s)
Kernel : 2.142368 ms (78.311548 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 4.607296 ms
Compute can overlap with one transfer: 2.464928 ms
Compute can overlap with both data transfers: 2.142368 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 5.033206 ms
Avg. time when overlapped using 4 streams : 4.325859 ms
Avg. speedup gained (serialized - overlapped) : 0.707348 ms
From what I’ve read the chromebook has different memories with much higher bandwidth than the regular tegra K1, this may well account for the performance difference as even the nbody problem could be memory bound.
Couldn’t find a source regarding the memory by quick googling.
Anyhow, ~157 GFLOP for nbody is pretty standard on Jetson Tk1.