Question about the concepts of throughput and latency

vshaka · July 21, 2016, 2:45pm

I am a green hand and reading the manual of “CUDA C Programming Guid”.

I don’t understand the concept of throughput in the manual.

For example, in the section 5.3.2, there is a sentence “For example, if a 32-byte memory transaction is generated for each thread’s 4-byte access, throughput is divided by 8.”

Can someone explain the throughput in detail?

Thanks a lot.

Robert_Crovella · July 21, 2016, 4:24pm

Questions like this come up frequently. There is a great deal of published information on it. You might want to study slides 30-48 in the following presentation:

[url]http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url]

In a nutshell, DRAM subsystems on GPUs have a minimum addressable quantity, which is usually 32 bytes. If you request 32 bytes, and use 32 bytes, then that is full throughput for the memory bus: every requested byte is actually used by the program. If you request 32 bytes (the minimum) but only use 4 bytes, then 28 bytes transferred are wasted.

When adjacent threads in a warp request data, if that data is all adjacent, then the 32-byte transactions requested from DRAM can be effectively utilized by various threads in the warp. This is 100% utilization or throughput. If, on the other hand, each thread is generating a non-adjacent address, then to satisfy each threads needs, many more transactions will be required from DRAM, but a lot of “wasted” bytes will be transferred, and “throughput” goes down.

c.ping · July 21, 2016, 7:56pm

Nice resource, thanks

vshaka · July 22, 2016, 1:24am

Thanks a lot.

It’s a very clear answer. I can futher understand the throughput now.