Question about kernel granularity

Hello Everyone,

I checked this question in the forum but I haven’t encountered. If it was asked before, I am sorry for re asking.

I am trying to understand effects of blocksize in very very small saxpy example. I have done benchmark with same data size and thread size and different blocksize. I saw that performance is slightly different if I don’t create blocksize quite small. Here is some of my results

I used 125829120 as datasize and 192 as thread size. I used K80 gpu.

I can understand why 1st test is slow because blocksize is small. But I didn’t understand that why 3rd case is better than 2nd case ?

    [b]
  1. <>> -----> 32.954182ms
  2. <>> -----> 9.980893ms
  3. <>> --> 9.275939ms
  4. [/b]
__global__ void saxpy(int n, float a, float *x, float *y) {
		for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < n;
				i += blockDim.x * gridDim.x)	
		y[i] = a * x[i] + y[i];
		
}

I’m not able to reproduce the large difference between your 102 case and your 256 case, I suspect a benchmarking methodology issue there (no warmup, perhaps?)

Regarding the other difference(s), I suspect that the block configuration leads to slightly different L2 cache efficiencies, resulting in slightly different overall execution times, for this memory-bound code. Here is my full test case:

$ cat t913.cu
#include <stdio.h>
#include <cuda_profiler_api.h>

__global__ void saxpy(int n, float a, float *x, float *y) {
                for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < n;
                                i += blockDim.x * gridDim.x)
                y[i] = a * x[i] + y[i];

}

int main(){

  const int dsize = 125829120;
  const int ds = dsize*sizeof(float);
  const float my_a = 1.0f;
  float *d_x, *d_y;
  float et;
  int nblk = 102;
  cudaMalloc(&d_x, ds);
  cudaMalloc(&d_y, ds);
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  saxpy<<<nblk,192>>>(dsize, my_a, d_x, d_y); // warm up
  cudaProfilerStart();
  cudaEventRecord(start);
  saxpy<<<nblk,192>>>(dsize, my_a, d_x, d_y); // warm up
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("case %d time: %fms\n", nblk, et);
  nblk = 256;
  cudaEventRecord(start);
  saxpy<<<nblk,192>>>(dsize, my_a, d_x, d_y); // warm up
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  printf("case %d time: %fms\n", nblk, et);
  nblk = 65536;
  cudaEventRecord(start);
  saxpy<<<nblk,192>>>(dsize, my_a, d_x, d_y); // warm up
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  cudaProfilerStop();
  printf("case %d time: %fms\n", nblk, et);
  cudaError_t res = cudaGetLastError();
  if (cudaSuccess != res) printf("cuda error: %s\n", cudaGetErrorString(res));
  return 0;
}
$ nvcc -arch=sm_37 -o t913 t913.cu
$ ./t913
case 102 time: 11.679136ms
case 256 time: 11.760864ms
case 65536 time: 10.948704ms
$ nvprof --profile-from-start off  --metrics l2_read_throughput ./t913
==6364== NVPROF is profiling process 6364, command: ./t913
==6364== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==6364== Replaying kernel "saxpy(int, float, float*, float*)" (done)
case 102 time: 130.918945ms
==6364== Replaying kernel "saxpy(int, float, float*, float*)" (done)
case 256 time: 121.372002ms
==6364== Replaying kernel "saxpy(int, float, float*, float*)" (done)
case 65536 time: 119.577789ms
==6364== Profiling application: ./t913
==6364== Profiling result:
==6364== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
    Kernel: saxpy(int, float, float*, float*)
          3                        l2_read_throughput                     L2 Throughput (Reads)  79.940GB/s  85.619GB/s  81.845GB/s
$

Across the 3 profiled kernels, they have substantially different l2_read_throughput. This could give rise to differences in overall timing.

measured read tput ratio: 85.6/79.9 = 1.07
measured krnl time ratio: 11.7/10.9 = 1.07

Thank you for answer txbob.

Before doing this test, I would expect that small grid size yields better performance, since allocation less thread blocks is cheap. I made a dummy kernel test (following code) to see kernel configuration cost. Results is reasonable, when you need to allocate more thread blocks kernel is slow. But Even though allocation smaller grid size is cheaper, how can bigger grid size yield better performance ?

Checking L2 cache efficiency is good idea. But why does it matter for this application? I think, data size is aligned with thread size, so application does not need any L2 cache.

__global__ void dummy(int n, float a, float *x, float *y) {
  for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < n;i += blockDim.x * gridDim.x)  {
  }             
}
$ ./saxpy
[dummy] case 102 time: 0.614336ms
[dummy] case 256 time: 0.666688ms
[dummy] case 65536 time: 0.786048ms

[saxpy] case 102 time: 11.698720ms
[saxpy] case 256 time: 11.741568ms
[saxpy] case 65536 time: 10.964160ms

that seems like a reversal of your previous position:

I don’t understand this statement at all:

All data traffic to main memory (global device memory) flows through the L2 cache. If the L2 cache reports a higher or lower efficiency (e.g. read throughput) that will most certainly show up in a memory bound code as an effect on performance. Rather than trying to come up with reasons why we should discount the profiler data, I’d rather use the profiler data to inform possible theories that might explain it.

So the profiler output is not a complete answer, but I believe it is a useful starting point to develop hypotheses.

I don’t have a solidly tested theory, but I would start with an additional assumption that the L2 cache is not fully associative. (e.g. perhaps it is 4-way set associative, or 8-way set associative, or something like that). I do not know this to be true, but I think microbenchmarking can support or disprove the assumption, and in my experience fully associative cache designs are rare. In addition, we know that covering a very large address space (large data set) will involve the TLB. The memory footprint for your test is ~1GB, which is probably on the “edge” of where TLB inefficiency may come into play, for strided access patterns over a large data set.

So what are the actual access patterns, and do they differ for the different grid size configurations?

Each thread is following a grid-striding loop. For the largest grid, the grid width is 12582912, and so each thread in the grid will stride 10 times, taking hops that are 1/10 of the data set size. For the smallest grid, each thread will take approximately 6400 hops. Furthermore, we must consider that inevitably, warps across the grid will get widely out of sync with each other, and also threadblocks will retire and allow new threadblocks to start in the largest grid case, but maybe not so much or at all in the smallest grid case. In the smallest grid case, a K80 GPU with 13 SMs and its extra large register file might be able to support 8 of your threadblocks at once (8192 < 2048), which would account for the entire grid being resident for most of the duration of the kernel execution (813 = 104 which is greater than 102).

My conjecture is that when you roll this chaotic access pattern together, the larger grid results, either due to cache set associativity, or TLB patterns (or both), in a more efficient use of the L2 cache. Again, I have not connected all the dots, but this theory at least is consistent with the profiler data. (Normally I wouldn’t expect TLB patterns to have much of an effect for a ~1GB footprint, but you are asking for an explanation of a 7% performance difference here, so I wouldn’t rule it out as a possible small contributor.)

It may seem counterintuitive, but I suspect that that block retirement that will occur regularly with the larger grid size (coupled with a coarse grid-striding loop) may actually lead to a more organized access pattern on average, than the case where the entire grid is resident on the GPU, and the pattern could become almost completely chaotic.

Or feel free to advance your own theory. If your theory discounts the L2 profiler data which lines up nicely with the observed perf difference, I will be skeptical.

Regarding your “dummy” case, I discount that data. It’s not reflective of what to expect in a memory bound code. The GPU is a latency hiding machine, and this includes nearly all machine latencies you can imagine, even the latency associated with launching a large number of blocks. Your dummy case is not memory bound. Your saxpy case is. Even if the differences evident in the dummy case manifested in the saxpy case, it’s on the order of 0.2ms whereas the 7% difference you’re chasing is on the order of 0.7ms, so something else must be at work, even if the threadblock launch latency is a factor.

Thank you very much for long reply. It really help me to realise what could happen in the GPU.

It is fact that L2 cache behaviour is changing when I change grid size somehow. TLB or 4-way associative cache could be reason as you are saying. I agree with that.

In my previous message I was trying say that there is also cost of allocation thread block. This cost must be higher when I create bigger gridsize. However, apparently cache efficiency has more strong influence than thread block allocation.

I would start with an additional assumption that the L2 cache is not fully associative. (e.g. perhaps it is 4-way set associative, or 8-way set associative, or something like that). I do not know this to be true

i haven’t seen fully associative caches in CPUs since 32-entry 80386 TLB cache