Details of Global and L2 cache configuration in Tesla K40

Hi

I am curious about some configuration details of the global memory and L2 cache of Tesla K40. There are few questions that I was searching but couldn’t get much information and so I thought of asking here. My questions are listed below :

  1. How many banks are present in the global memory of K40?
  2. I saw that the global memory bus width is 384 bits; but what is the width of each bank (like if I exceed
    that width then the data would be allocated in the next bank)?
  3. How the whole global memory is divided into each bank; I mean size of each bank?
  4. I searched for a microbenchmark but couldn’t find it. So is there a microbenchmark available that could
    help me to understand this details?
  5. How the L2 is connected to each of the global memory banks?
    In this link [url]http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discloses-Full-Memory-Structure-and-Limitations-GTX-970[/url] for maxwell the L2 banks are same as the global memory banks and connected to each
    bank. Is this same for the K40 as well? Then how the same questions would go for L2 as well and is there a microbenchmark to find those details for L2 ?

Any help would be really appreciable. Thank you.

The answer is right in front of you!

External Media

You can think of the 8 banks of memory as 8 independent memory channels just like on Intel CPUs.

The address space is divided round robin across the banks/channels:

word0 → bank 0
word1 → bank 1

word7 → bank 7
word8 → bank 0
word9 → bank 1

The article says the stride to span all 8 banks is 1KiB, so that means each word is 128 bytes. However, each bank is only 4 bytes wide, so to get 128 bytes, it does a 32 long burst access (these days with bandwidth always increasing while latency stays constant, you have to use larger and larger block sizes to utilize the available bandwidth).

I think you can assume that the L2 organization is the same as the external memory.

I don’t think knowing these details will help speed up your code. I seriously doubt that you can access all 8 banks in parallel from a single warp if that’s what you’re thinking.

As long as your memory accesses are aligned to 128 bytes (for each transaction), then throughput should be good.

Hi

Thank you for your reply. However the information you provided is not aligned with my query. I am clear about the information provided in the website. I didn’t asked anyone to reiterate the information given in the link in my first post. I want to get the similar kind of information for K40. I am also seeking if there is any microbenchmark to gather this information. I understood whatever provided in the website. But that is for maxwell. I am seeking similar information (more specifically the questions I listed) for Kepler architecture and if there is any microbenchmark to understand this details.

Does anyone have an update for this thread for the Voltas?

Thanks!