Global memory access bottleneck

Hi,

i am hitting a bottleneck when accessing data from global memory. I am working with a GTX 980 card. Although the documentation states that the L2 cache transaction size is 32 bytes i am not able to reach anywhere near the theoretical global memory bandwidth unless i am loading coalesced block of at least 256 bytes (and i am prete sure this would also bottleneck if the memory bandwidth was high enough).

From my measurements it looks like the bottleneck is at the GPU level as a whole and the throughput seems to be equal to 1 transaction (coalesced patch of memory) per clock cycle irrespective of how big the transaction size is.
I came to this conclusion because:

  • changing the memory clock makes no difference unless it is so low that it becomes the limiting factor
  • changing the GPU clock changes the throughput close to linearly
  • increasing the number of independent load requests has no effect so latency is not the cause of the problem
  • issuing only 8 block - so that only the first half of the 16 SMs (as shown by Nsight) is executing - again no change (this suggests that the bottleneck is also not at the SM or GPC level)

So… can anyone tell me what the bottleneck is? Is it documented somewhere? Is there any way to somehow work around this (i do not have much hope that it is possible)?

I am providing the code so you can try it out yourself.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

#define DATA_SIZE	(1 << 30)
#define DATA_ACCESSES (1 << 6)
#define BLOCK_SIZE 128
#define BLOCKS_COUNT 1024

template<int COUNT, int PAGE_SIZE, typename T>
__launch_bounds__(BLOCK_SIZE, 3)
__global__ void kernel(T *data)
{
	const int pages_count = DATA_SIZE / (PAGE_SIZE * sizeof(T));
	const int bid = (blockIdx.x * BLOCK_SIZE + threadIdx.x) / PAGE_SIZE;
	const int tid = (blockIdx.x * BLOCK_SIZE + threadIdx.x) % PAGE_SIZE;
	unsigned int dummy[COUNT];
	for (int c = 0; c < COUNT; c++)
	{
		dummy[c] = 0;
	}
	for (int i = 0; i < DATA_ACCESSES; i++)
	{
		#pragma unroll
		for (int c = 0; c < COUNT; c++)
		{
			unsigned int page = ((bid * COUNT + c) * DATA_ACCESSES + i) * 1031;
			unsigned int index = (page % pages_count) * PAGE_SIZE + tid;
			T v = data[index];
			unsigned int * va = reinterpret_cast<unsigned int *>(&v);
			#pragma unroll
			for (int j = 0; j < (sizeof(T) / sizeof(unsigned int)); j++)
			{
				dummy[c] ^= va[j];
			}
		}
	}
	if (bid != 1 << 24) return;
	for (unsigned int c = 0; c < COUNT; c++)
	{
		reinterpret_cast<unsigned int *>(data)[COUNT * c + tid] = dummy[c];
	}
}

int main()
{
	uint4 * data;
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaMalloc(reinterpret_cast<void**>(&data), DATA_SIZE);
	const int count = 32;
	const int page_size = 8;
	kernel<count, page_size> << <BLOCKS_COUNT, BLOCK_SIZE >> >(data);
	cudaDeviceSynchronize();
	cudaEventRecord(start, nullptr);
	kernel<count, page_size> << <BLOCKS_COUNT, BLOCK_SIZE >> >(data);
	cudaEventRecord(stop, nullptr);
	cudaDeviceSynchronize();
	float duration;
	cudaEventElapsedTime(&duration, start, stop);
	printf("%f ms\n", duration);
	printf("%f GB/s\n", (1000.0f/duration) * count * sizeof(uint4) * DATA_ACCESSES * BLOCKS_COUNT * BLOCK_SIZE / static_cast<float>(1 << 30));
	return 0;
}

you can change the memory block size by changing the page_size variable (block size in bytes = page_size * sizeof(uint4) - but you can also try to change the type to uint2 or just uint and increase the page_size accordingly, it will make no difference).

Your findings appear correct. As a rough rule of thumb on most cards you need to have two 128 byte transactions in flight per warp to reach about 80% of theoretical global memory bandwidth. This is to cover the latency of a fully loaded memory system, which can be higher than that of an otherwise idle GPU (more like 2000 cycles rather than the 400-800 cycles the Programming Guide used to mention - sorry I don’t have current numbers at hand from either the doc or newer devices).

tera thank you for your comment. I agree that everything you said applies in a general case… however in this particular case latency is not the problem.
I am already issuing way more loads than should be needed to cover any latency to the point where i should be achieving full practical global memory bandwidth (which is usually little more than 90% of the theoretical bandwidth).
What is actually happening is - as i launch more warps in parallel, the performance is actually decreasing… the GPU seems to be choking as i increase the number of transaction in flight.

Concretely, if i use the code above and make these changes (smaller blocks so i can increase BLOCKS_COUNT with better granularity):

#define DATA_ACCESSES (1 << 10)
#define BLOCK_SIZE 32
...
__launch_bounds__(BLOCK_SIZE, 12)

and then i vary BLOCKS_COUNT in steps of 16 (the number of SMs on a GM204 GPU), i get the following results:

Warps/SM Throughput
1 93.168999 GB/s
2 133.896057 GB/s
3 147.648956 GB/s
4 144.692673 GB/s
5 144.206985 GB/s
6 143.908417 GB/s
7 143.298279 GB/s
8 142.075485 GB/s
9 145.146011 GB/s
10 144.437073 GB/s
11 143.424698 GB/s
12 140.358185 GB/s

as you can see the throughput peaks at 3 warps per SM (each warp issuing 32 independent loads, each load loading 4 x 128 bytes).

Now if i change the size of the coalesced region from 128 bytes to 256 bytes (by changing the variable page_size from 8 to 16) we get these results:

1 99.104408 GB/s
2 157.901657 GB/s
3 181.652252 GB/s
4 188.924500 GB/s
5 189.297562 GB/s
6 190.258759 GB/s
7 190.666367 GB/s
8 191.455856 GB/s
9 191.367142 GB/s
10 191.371674 GB/s
11 191.544296 GB/s
12 191.698975 GB/s

Now this looks how it should look like when the limiting factor in the beginning is latency and than the global memory bandwidth. My memory is running at 3505 MHz over 256-bit-wide bus which translates to 219 GB/s of theoretical bandwidth. So in this case we are reaching almost 90% (the actual BW is even little higher as the timing is no that precise and includes all the overhead of starting a kernel).

So you see… the only difference between these two cases is how big the coalesced patch is, everything else is the same, we are issuing the same number of independent loads which are loading the same overall amount of data. And we also see that in the first case the bottleneck is’t the memory BW ant it also isn’t latency.

The question are:
What is the bottleneck?
Is it documented anywhere?
Is there some workaround?

Achieving 90% of theoretical bandwidth is an extremely good result, I do not recall seeing more than about 85% in actual applications on GPUs without ECC.

What you are observing are likely artifacts of the interaction between internal buffering, re-ordering, scheduling, and coalescing mechanisms in the memory controller, which change from GPU generation to GPU generation. I am not aware of any detailed documentation on the internal mechanisms of Maxwell-class GPU memory controllers, either from NVIDIA or from third parties who have reversed engineered it.

As for a workaround, it seems you have already discovered it: Use coalesced regions of 256 bytes. You may also want to look at the impact of the total number of threads running on memory throughput. In a previous GPU generation, maximizing memory throughput required up to 20x “oversubscription” of each SM’s warp execution resources. That is, the total number of thread blocks in the launched grid was ideally 20 times the number of thread blocks able to run concurrently. I do not know whether this heuristic still applies to Maxwell.

Was able to get 90.97% out of the GTX 980m for sum reduction with a power of 2 size;

GeForce GTX 980M @ 160.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         96.79           60.37   43.3             Pass
 2097152         112.60                  70.24   74.5             Pass
 4194304         128.75                  80.31   130.3            Pass
 8388608         136.75                  85.30   245.4            Pass
 16777216        141.31                  88.14   474.9            Pass
 33554432        143.84                  89.72   933.1            Pass
 67108864        144.86                  90.36   1853.1
 134217728       145.84                  90.97   3681.3

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        140.85                  87.85   416.9            Pass
 14680119        140.79                  87.82   417.1            Pass
 18875600        140.11                  87.39   538.9            Pass
 7434886         99.97           62.35   297.5            Pass
 13324075        132.60                  82.71   401.9            Pass
 15764213        136.11                  84.90   463.3            Pass
 1850154         47.10           29.38   157.1            Pass
 4991241         93.11           58.08   214.4            Pass

I guess that tells us that the memory controller of Maxwell-class GPU is more efficient than the memory controllers of previous GPU generations :-)

thank you guys for your comments.
@njuffa regarding the workaround… that is the precisely the one thing that i can not change… the algorithm has to work with regions of 128 bytes and there is no way to coalesce the accesses as the regions are more or less randomly seeded in a rather large block of memory (more than 1 GB).

It is unfortunate that this bottleneck is not properly documented, it could have saved me quite a lot of time that i wasted trying to achieve higher throughput… now it is obvious that every attempt at that was doomed to fail :/

oh well… on to another challenges :)

thanks guys

The transactions in flight do not need to be adjacent. Try prefetching the next 128 bytes block while you are processing the previous one. This might need checking with “cuobjdump -sass” and some fiddling and twiddling to ensure the compiler is ordering these as intended. I used to have a bit of trouble with the compiler ordering a register-to-register move directly after the load (which defeats the purpose of prefetching), so check that’s not the case for your code.

I do not think we can assume that there is a specific bottleneck somewhere in the Maxwell memory controller’s design. When there are multiple mechanisms interacting there could be any number of scenarios based on specific address patterns presented to the controller, and as tera points out, that is also a function of the machine code generated by the compiler, which at minimum schedules but may also re-order memory operations.

I suspect it is a bit like internet congestion, which is likewise a collection of scheduling, buffering, and protocol mechanisms, often tuneable: It is very difficult to get a handle on the overall system behavior even for people familiar with all the details,and non-intuitive results such as adding buffering lowering throughput easily occur.

So even given a detailed description of the memory controller it is doubtful whether it would be clear what the optimal strategy is. I used to work on, and also optimize software for, x86 CPUs, and the last time any optimization strategy could be completely devised on paper was probably the Pentium MMX (P55C) and the AMD K6-2, assuming a solid understanding of the CPU, chipset, and memory used at the time. Ever since then, optimization has been an experimental process that may well be informed by detailed device documentation, but cannot simply rely on it.

As for efficiency of memory throughput on Maxwell, I see numbers in line with my previously stated rule of thumb. My Quadro K2200 (sm_50) has a theoretical throughput of 80.16 GB/sec. The measured throughput of DCOPY and ZCOPY are only about 80% of that:

> dcopy -n16777216
dcopy: operating on vectors of 16777216 doubles (= 1.342e+008 bytes)
dcopy: using 128 threads per block, 65520 blocks
dcopy: mintime = 4.167 msec  throughput = 64.42 GB/sec

> zcopy -n16777216
zcopy: operating on vectors of 16777216 double2s (= 2.684e+008 bytes)
zcopy: using 128 threads per block, 65520 blocks
zcopy: mintime = 8.313 msec  throughput = 64.58 GB/sec

I guess the difference to CudaaduC’s test is that *COPY requires equal amounts of reads and writes, and is therefore exposed to the DRAM’s read-write-turnaround, while the sum reduction only performs loads?