Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x

Please could someone help clarify?
I have read this in the cuda documentation regarding shared memory in the aforementioned compute capabilities:

“A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).”

(Programming Guide :: CUDA Toolkit Documentation)

There are then two diagrams, one showing examples of strided access, the other showing examples of memory read accesses that involve the broadcast mechanism.

I am confused by the first diagram - as it shows linear addressing with a stride of two 32-bit words (two-way bank conflict). I don’t understand why this results in bank conflicts when threads in the same warp access the same bank, when similar case in the second diagram uses the broadcast mechanism and this is no problem?

Please can anyone help?

Think of shared memory as a 2D array of 32-bit memory cells. The width of this array is 32 columns (not bytes - it is 32 columns of 32-bit memory cells, so 128 bytes wide, but that isn’t the way to think of it). In this context, a column is a bank. If a warp shared read request generates 2 or more addresses from separate threads in the warp, for a given instruction, that fall in the same bank (ie. belong to the same column in the 2D mental model), then bank conflicts will result.

However, there is an exception. If two or more addresses generated across the warp from a warp shared read instruction in a particular cycle fall in the same column and are all in the same row , this is a special case. In this case, these addresses generated in the same warp belong to the same 32-bit cell, using the previous definition for that term. In this special case, the GPU has a mechanism to service as many reads as required from that single cell, across the warp, in a single cycle. This particular pattern, by itself, does not create bank conflicts, even though technically all the named accesses belong to the same column. This mechanism is referred to as the “broadcast” mechanism.

Ahh ok, thank you. I didn’t get this clarity from the documentation.
I guess this was because it read in the docs like there was one 32 bit word per bank?
“Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks”
How big is each bank actually?

Shared memory can be thought of as a 2D array, where each column has a width of 32-bits, and there are 32 columns. Therefore if shared memory happens to have a size of 48kbytes, for example, from the programmer’s perspective, then each column is 48kbytes/128 bytes “tall” i.e. 384 “rows” or elements tall.

Element 0, 32, 64, 96, etc. belong to bank 0 (or column 0).
Element 1, 33, 65, 97, etc. belong to bank 1
Element 2, 34, 66, 98, etc. belong to bank 2

Element 31, 63, 95, etc. belong to bank 31

each element is 32-bits or 4 bytes “wide”.

An overall “row” of this 2D array is 32x4 = 128 bytes wide.

I see. Thank you very much. I wish it had put it like that in the documentation! :)

I see. Thank you very much. I wish it had put it like that in the documentation! :)