TK1 Memory Bandwidth

I was using STREAM to test the memory bandwidth of the TK1 board and got lower than expected results. STREAM reported a max bandwidth of ~5.5 GBs. I was expecting something closer to 14 GBs.

933 Mhz * 2 (for DDR) * 64 (interface width) / 8 (bits per byte) = 14.928 GBs

Am I calculating something incorrectly?

Thanks for any guidance.

Doesn’t STREAM measure “reading → op → writing”? If so, then double the result to get a number that’s close to the theoretical memory bandwidth.

If you run the ‘bandwidthTest’ CUDA Sample you’ll get ~12760 MB/sec.

STREAM does a couple of different tests.

a = b
a = scaleb
a = b * c
a = b + scale
c

The results it presents already double or triple accordingly for the different operations.

I wasn’t aware of the CUDA bandwidth test. I’ll check that out and see if I get different results. Perhaps it is just a badly optimized implementation in STREAM?

A quick skim of the STREAM benchmarks shows results that appear to be 50% of peak memory bandwidth – at least for processors I’m familiar with.

I got the CUDA bandwidthTest running. Results are below. I’ll admit upfront that I am new to CUDA. But the concept of host and device is not quite clear to mee on the SoC. I understand that you could have different bandwidths when the GPU is hanging off a PCIe bus, but the GPU has direct access to the memory on the TK1 right? Am I wrong in thinking all these bandwidth numbers should be comparable?

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1030.7

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5368.6

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11422.9

You are correct.

The bandwidthTest CUDA Sample should’ve been rewritten for the TK1 to use the Managed Memory API instead of performing pointless copies. H>D and D>H transfers should effectively be noops in a real application.

It’s not entirely true. On Tegra K1, host memory and device memory are both stored in the same physical memory (DRAM) since the GPU doesn’t have its own dedicated GRAM but it has direct access to the main DRAM. But the CPU and GPU are still 2 different processors using their own address spaces and memory & cache access methods. And thus depending on how you configure things or write your code, you often will still need to transfer data between being owned by the CPU vs owned by the GPU. There are different ways of doing this and some use cases are faster with traditional H->D & D->H transfers while other use cases are faster with the new managed memory API.

It mostly comes down to a cache issue, since Tegra K1 does not have full hardware cache coherency between the CPU & GPU, and whether your application needs to write to memory from both the CPU & GPU or just one of them. For now we typically recommend people to just use CUDA on Tegra the same way they would on desktop (ie: with explicit H-D transfers), but if you want to dedicate time to optimize your code then you should try managed memory options and possibly restructuring your code pipeline to take advantage of it better.

I’ve gotten ~13 GB/s with my ubenchmark.

There are different ways of doing this and some use cases are faster with traditional H->D & D->H transfers while other use cases are faster with the new managed memory API.

Are there any documents that explains these things?

Thank you.

Are there any documents that explains these things?

Thank you.