Reading the Kepler tuning guide, section 1.4.3.1 says the following:
“This bandwidth increase is exposed to the application through a configurable new 8-byte shared memory bank mode. When this mode is enabled, 64-bit (8-byte) shared memory accesses (such as loading a double-precision floating point number from shared memory)…”
And then the Pascal tuning guide has, in section 1.4.5.1:
“Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi and Kepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell and Pascal.”
Then in 1.4.5.2:
"[i]Kepler provided an optional 8-byte shared memory banking mode, which had the potential to increase shared memory bandwidth per SM for shared memory accesses of 8 or 16 bytes. However, applications could only benefit from this when storing these larger elements in shared memory (i.e., integers and fp32 values saw no benefit), and only when the developer explicitly opted in to the 8-byte bank mode via the API.
To simplify this, Pascal follows Maxwell in returning to fixed four-byte banks. This allows, all applications using shared memory to benefit from the higher bandwidth, without specifying any particular preference via the API.[/i]"
I will need to declare the shared memory space as double to avoid some possible overflow during the computation, and if I understand the 1.4.5.2 section of Pascal tuning, I don’t need to specify anything in the API to use the 8 bytes shared memory, which I conclude as just needing to declare the shared object as double instead of float.
Finally, let me ask you the following:
- Is this understanding correct?
- If so, this applies to Maxwell and Pascal, but not Kepler, which means I would still need to select the cache type in the program. The function cudaFuncSetCacheConfig is mentioned here: [url]https://devblogs.nvidia.com/using-shared-memory-cuda-cc/[/url] . But I don’t find it in the API doc: [url]https://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__EXECUTION_1g4f35d04be20a41c5df96613a748eecc1[/url] . Any idea on how I could get the 8 bytes shared memory working on a program from Kepler and above?