Shared Memory Bank Size mode 4 Byte VS 8 byte Kepler

Hi. I don’t understand the difference between this two modalities. I write a simple code that use a shared memory of 32 float and i read many times every location. I do this in both two modalities and after I profile this code. Well the only thing that change is the bank size, 4 to 8, but anything else change. Same transaction value, troughput, efficiency…I do the same experiment with 32 double, but nothing changes. So what is the difference? Someone can explain me with an example code? Thx a lot. I use a Jetson TK1 for my experiments.

__global__ void sharedmemory_kernel() {

        __shared__ float b[NUM_THREAD*2];

	b[threadIdx.x] = (float)threadIdx.x;

	__syncthreads();
#pragma unroll
	for (int i = 0; i < 10; i++)
			float a = b[threadIdx.x];

}

I have never encountered any significant performance differences when switching between 4-byte and 8-byte bank modes, and the handful of people who I discussed the issue with had likewise not encountered any significant differences. I would suggest ignoring this possible configuration knob and simply going with the default. If I am not mistaken, Maxwell discontinued configurable bank width, which may be another indication that this wasn’t a particularly useful feature of Kepler.

Pretty much every new generation of GPUs introduces some new features that turn out to be less effective or useful than anticipated and are thus discontinued in the following generation(s). That GPU architecture hasn’t solidified in the past decade to the point that CPUs reached after four decades of development is to be expected. GPUs are still cutting-edge technology with much room for (micro-)architectural exploration and innovation, which I would expect to continue after silicon process technology driven Moore’s Law comes to its end in the next few years.

I agree with njuffa that there will frequently be no difference between the two cases.

I believe it’s possible to demonstrate a difference, however. Whether it’s significant or not, I don’t know, it would depend on the real test case, not these made-up examples.

Modifying code from here:

http://stackoverflow.com/questions/29933976/cuda-shared-memory-bank-conflicts-report-higher

I propose the following test case:

$ cat t750.cu
#include <stdio.h>

typedef double mytype;

template <typename T>
__global__ void conflict() {
    __shared__ T values[33];
    values[threadIdx.x] = threadIdx.x;
    values[threadIdx.x+1] = threadIdx.x;
}

int main(){

#ifdef EBM
  cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
#endif

  conflict<mytype><<<1,32>>>();
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_35 -o t750 t750.cu
t750.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(19): here

$ nvprof --metrics shared_replay_overhead ./t750
==46560== NVPROF is profiling process 46560, command: ./t750
==46560== Profiling application: ./t750
==46560== Profiling result:
==46560== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
 Kernel: void conflict<double>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.142857    0.142857    0.142857
$ nvcc -arch=sm_35 -DEBM -o t750 t750.cu
t750.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(19): here

$ nvprof --metrics shared_replay_overhead ./t750
==46609== NVPROF is profiling process 46609, command: ./t750
==46609== Profiling application: ./t750
==46609== Profiling result:
==46609== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
 Kernel: void conflict<double>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.000000    0.000000    0.000000
$

The compiler doesn’t seem to be optimizing away the necessary code to show the difference, in this case. I have not tested the behavior on a cc5.x device, but your TK1 is a cc3.2 device, so I would expect it to behave as above.

The above case uses double, but we can also construct a rather contrived float test case as well, demonstrating the reverse: that cudaSharedMemBankSizeEightByte Mode is harmful:

$ cat t751.cu
#include <stdio.h>

typedef float mytype;

template <typename T>
__global__ void conflict() {
    __shared__ T values[1025];
    values[(threadIdx.x*33)+1] = threadIdx.x;
}

int main(){

#ifdef EBM
  cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
#endif

  conflict<mytype><<<1,32>>>();
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_35 -DEBM -o t751 t751.cu
t751.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(18): here

$ nvprof --metrics shared_replay_overhead ./t751
==47017== NVPROF is profiling process 47017, command: ./t751
==47017== Profiling application: ./t751
==47017== Profiling result:
==47017== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
 Kernel: void conflict<float>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.125000    0.125000    0.125000
$ nvcc -arch=sm_35 -o t751 t751.cu
t751.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(18): here

$ nvprof --metrics shared_replay_overhead ./t751
==47068== NVPROF is profiling process 47068, command: ./t751
==47068== Profiling application: ./t751
==47068== Profiling result:
==47068== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
 Kernel: void conflict<float>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.000000    0.000000    0.000000
$

Well. How you can explain this situation? I have tested the first code that you have posted. When the mode is 4 byte, there is a conflict. When the mode is 8 byte, don’t. But it is similar to a race codition, because if i make a __synchronize() between the two memory access, the are no conflicts in both modalities.

I do some studies on the shared memory conflicts. In particular, how much energy (Watt) spends when there is a conflict. But i don’t understand how works this modalities.

I don’t know what __synchronize() is. That doesn’t appear to be any part of CUDA.

I added __syncthreads() in between the two lines of code, it didn’t seem to make much difference:

$ cat t766.cu
#include <stdio.h>

typedef double mytype;

template <typename T>
__global__ void conflict() {
    __shared__ T values[33];
    values[threadIdx.x] = threadIdx.x;
    __syncthreads();
    values[threadIdx.x+1] = threadIdx.x;
}

int main(){

#ifdef EBM
  cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
#endif

  conflict<mytype><<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -o t766 t766.cu
t766.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(20): here

$ nvprof --metrics shared_replay_overhead ./t766
==1968== NVPROF is profiling process 1968, command: ./t766
==1968== Profiling application: ./t766
==1968== Profiling result:
==1968== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GT 640 (0)"
        Kernel: void conflict<double>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.125000    0.125000    0.125000
[bob@cluster1 misc]$ nvcc -arch=sm_35 -DEBM -o t766 t766.cu
t766.cu(8): warning: variable "values" was set but never used
          detected during instantiation of "void conflict<T>() [with T=mytype]"
(20): here

$ nvprof --metrics shared_replay_overhead ./t766
==2024== NVPROF is profiling process 2024, command: ./t766
==2024== Profiling application: ./t766
==2024== Profiling result:
==2024== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GT 640 (0)"
        Kernel: void conflict<double>(void)
          1                    shared_replay_overhead             Shared Memory Replay Overhead    0.000000    0.000000    0.000000
$

The “Shared Memory Replay Overhead” is still non-zero, unless I define EBM, in which case it is zero. In fact, the only kernel line of code that is relevant for this example is the last one. These two lines can be commented out:

values[threadIdx.x] = threadIdx.x;
__syncthreads();

and the “Shared Memory Replay Overhead” observation (zero vs. non-zero) is the same.

Based on my understanding of bank conflicts, I don’t think it would be ever logical to conclude that bank conflicts could be affected by use of __syncthreads()

Bank conflicts arise from the shared memory addresses emanating from a single load (or store) instruction in a single warp. There is no concept of bank conflicts arising from 2 instructions in the same warp, or 2 instruction from separate warps.

Sorry, I use synchthreads() :). Mmm. I use Visual profiler to take the metrics and i have this difference

Well. I try again and now works like you -.-’

Now I try to understand the differencies between the two modalities

@txbob In your opinion, is possible that after a conflict, the gpu do an auto alignment? I print all the shared memory and obviously is all coerent. Also the access address are right. So the access adresses are the same, the merory is the same…why a modality do a conflict and the other not?

I’m not sure where to start. If you need a tutorial on bank conflicts, I’m not sure I have the time to write it up, but you can certainly read up on it from various resources, such as by googling “gtc cuda bank conflicts”

Thx a lot for everything